Cardiac Risk Prediction

Human Activity Recognition

Example: Cardiac Risk Prediction

Example: Cardiac Risk Prediction

Suppose our dataset contains the following attributes:

Age: <40, 40-55, >55
Smoker: Yes, No
Exercise: Regularly, Rarely

Our target variable is: Risk: High, Low

Cardiac Risk Prediction Dataset

Age	Smoker	Exercise	Risk
<40	No	Regularly	Low
<40	No	Rarely	Low
<40	No	Rarely	Low
<40	Yes	Regularly	Low
<40	Yes	Rarely	High
40-55	No	Regularly	Low
40-55	No	Rarely	High
40-55	Yes	Regularly	High
40-55	Yes	Rarely	High
>55	No	Regularly	Low
>55	Yes	Regularly	High
>55	Yes	Rarely	High
>55	No	Rarely	High

Initial Entropy Calculation

From the table:

$p_{\text{Low}} = \frac{6}{13}$
$ p_{\text{High}} = \frac{7}{13} $

Thus, the entropy for this set $ S $ is: $ \text{Entropy}(S) = -\frac{6}{13} \log_2(\frac{6}{13}) - \frac{7}{13} \log_2(\frac{7}{13}) \approx 1$

Information Gain Calculation

For Age

Age <40:
- Total: 5
- High Risk: 1
- Low Risk: 4
- $ \text{Entropy} = -\frac{1}{5} \log_2(\frac{1}{5}) - \frac{4}{5} \log_2(\frac{4}{5}) $ $ \approx 0.72 $
Age 40-55:
- Total: 4
- High Risk: 3
- Low Risk: 1
- $ \text{Entropy} = -\frac{3}{4} \log_2(\frac{3}{4}) - \frac{1}{4} \log_2(\frac{1}{4}) \approx 0.81$
Age >55:
- Total: 4
- High Risk: 3
- Low Risk: 1
- $ \text{Entropy} \approx 0.81 \text{(same as above)}$

Let us compute the weighted entropies and then the information gain for the Age attribute in the next step.

$ \text{Weighted Entropy for Age} = \frac{5}{13} \times 0.72 + \frac{4}{13} \times 0.81 + \frac{4}{13} \times 0.81 \approx 0.78$

$ \text{Information Gain for Age} = \text{Initial Entropy} - \text{Weighted Entropy for Age} \approx 1 - 0.78 \approx 0.22$

For Smoker

Smoker = Yes:
- Total: 6
- High Risk: 5
- Low Risk: 1
- $ \text{Entropy} = -\frac{5}{6} \log_2(\frac{5}{6}) - \frac{1}{6} \log_2(\frac{1}{6}) \approx 0.65 $
Smoker = No:
- Total: 7
- High Risk: 2
- Low Risk: 5
- $\text{Entropy} = -\frac{2}{7} \log_2(\frac{2}{7}) - \frac{5}{7} \log_2(\frac{5}{7}) \approx 0.86$

$ \text{Weighted Entropy for Smoker} \approx \frac{6}{13} \times 0.65 + \frac{7}{13} \times 0.86 \approx 0.76 $

$ \text{Information Gain for Smoker} = \text{Initial Entropy} - \text{Weighted Entropy for Smoker} \approx 1 - 0.76 \approx 0.24 $

For Exercise

Exercise = Regularly:
- Total: 6
- High Risk: 2
- Low Risk: 4
- $ \text{Entropy} = -\frac{4}{6} \log_2(\frac{4}{6}) - \frac{2}{6} \log_2(\frac{2}{6}) = 0.92 $
Exercise = Rarely:
- Total: 7
- High Risk: 5
- Low Risk: 2
- $ \text{Entropy} = -\frac{2}{7} \log_2(\frac{2}{7}) - \frac{5}{7} \log_2(\frac{5}{7}) = 0.86 $

$ \text{Weighted Entropy for Exercise} = \frac{6}{13} \times 0.92 + \frac{7}{13} \times 0.86 \approx 0.89 $

$ \text{Information Gain for Exercise} = \text{Initial Entropy} - \text{Weighted Entropy for Exercise} \approx 1 - 0.89 \approx 0.11 $

Given the information gains:

Age: 0.22
Smoker: 0.24
Exercise: 0.11

The attribute Smoker has the highest information gain and will be the root node.

For the second-level decision, let’s take the subset of data for one of the age groups, say “Age < 40”, and repeat the computation of information gain for “Smoker” and “Exercise”. We’ll select the attribute with the highest information gain to split the data further.

Dataset subset for “Smoker = Yes”

Age	Smoker	Exercise	Risk
<40	Yes	Regularly	Low
<40	Yes	Rarely	High
40-55	Yes	Regularly	High
40-55	Yes	Rarely	High
>55	Yes	Regularly	High
>55	Yes	Rarely	High

From the subset, the probabilities are:

$ p_{\text{High}} = \frac{5}{6} = 0.83 $
$ p_{\text{Low}} = \frac{1}{6} = 0.17 $

Thus, the entropy for this subset $ S $ is: $-\frac{5}{6} \log_2(\frac{5}{6}) - \frac{1}{6} \log_2(\frac{1}{6}) $ $ \approx 0.65 $

For Age:

Age <40:
- Total: 2
- High Risk: 1
- Low Risk: 1
- $ \text{Entropy} = -\frac{1}{2} \log_2(\frac{1}{2}) - \frac{1}{2} \log_2(\frac{1}{2}) = 1 $
Age 40-55:
- Total: 2
- High Risk: 2
- Low Risk: 0
- $ \text{Entropy} = 0$
Age >55:
- Total: 2
- High Risk: 2
- Low Risk: 0
- $ \text{Entropy} = 0 $

$ \text{Weighted Entropy for Age} = \frac{2}{6} \times 1 + \frac{2}{6} \times 0 + \frac{2}{6} \times 0 \approx 0.33$

$ \text{Information Gain for Age} = \text{Initial Entropy} - \text{Weighted Entropy for Age} \approx 0.65 - 0.33 \approx 0.32$

Weighted Entropy and Information Gain for Exercise

Exercise = Regularly:
- Total: 3
- High Risk: 2
- Low Risk: 1
- $ \text{Entropy} = -\frac{2}{3} \log_2(\frac{2}{3}) - \frac{1}{3} \log_2(\frac{1}{3}) = 0.92 $
Exercise = Rarely:
- Total: 3
- High Risk: 3
- Low Risk: 0
- $ \text{Entropy} = 0 $

$ \text{Weighted Entropy for Exercise} = \frac{3}{6} \times 0.92 + \frac{3}{6} \times 0 = 0.46 $ $ \text{Information Gain for Smoker} = 0.65 - 0.46 = 0.19 $

Given the information gains for the “Age > 55” group:

Age: 0.32
Exercise: 0.19

“Age” has the higher Information Gain for this group, so we will choose this as the next level node for Smoker=Yes.

Dataset subset for “Smoker = No”

Age	Smoker	Exercise	Risk
<40	No	Regularly	Low
<40	No	Rarely	Low
<40	No	Rarely	Low
40-55	No	Regularly	Low
40-55	No	Rarely	High
>55	No	Regularly	Low
>55	No	Rarely	High

From the subset, the probabilities are:

$ p_{\text{High}} = \frac{2}{7} = 0.29 $
$ p_{\text{Low}} = \frac{5}{7} = 0.71 $

Thus, the entropy for this subset $ S $ is: $-\frac{2}{7} \log_2(\frac{2}{7}) - \frac{5}{7} \log_2(\frac{5}{7}) $ $ \approx 0.86 $

For Age

Age <40:
- Total: 3
- High Risk: 0
- Low Risk: 3
- $ \text{Entropy} = 0 $
Age 40-55:
- Total: 2
- High Risk: 1
- Low Risk: 1
- $\text{Entropy} = 1$
Age >55:
- Total: 2
- High Risk: 1
- Low Risk: 1
- $\text{Entropy} = 1$

$ \text{Weighted Entropy for Age} = \frac{3}{7} \times 0 + \frac{2}{7} \times 1 + \frac{2}{7} \times 1 \approx 0.57$

$ \text{Information Gain for Age} = \text{Initial Entropy} - \text{Weighted Entropy for Age} \approx 0.86 - 0.57 \approx 0.29$

Weighted Entropy and Information Gain for Exercise

Exercise = Regularly:
- Total: 3
- High Risk: 0
- Low Risk: 3
- $ \text{Entropy} = 0 $
Exercise = Rarely:
- Total: 4
- High Risk: 2
- Low Risk: 2
- $ \text{Entropy} = 1 $

$ \text{Weighted Entropy for Exercise} = \frac{3}{7} \times 0 + \frac{3}{7} \times 1 = 0.43 $ $ \text{Information Gain for Smoker} = 0.86 - 0.43 = 0.43 $

Given the information gains for the “Smoker=No” group:

Age: 0.29
Exercise: 0.43

“Exercise” has the higher Information Gain for this group, so we will choose this as the next level node for Smoker=No.

Thus, the structure of our decision tree becomes:

Root node: Smoker
- For Smoker=Yes: Split based on Age.
- For Smoker-No: Split based on Exercise.

To determine the final classification under each branch, you would continue the process of calculating Information Gain until you reach leaves with maximum purity (i.e. all High or all Low) or until you run out of attributes to split on.

Human Activity Recognition

Table of Contents

Example: Cardiac Risk Prediction

Cardiac Risk Prediction Dataset

Initial Entropy Calculation

Information Gain Calculation

For Age

For Smoker

For Exercise

Dataset subset for “Smoker = Yes”

For Age:

Weighted Entropy and Information Gain for Exercise

Dataset subset for “Smoker = No”

For Age

Weighted Entropy and Information Gain for Exercise