This notebook demonstrates the evaluation of a classifier for detecting sleep stages: Wakefulness, REM (Rapid Eye Movement), and NREM (Non-Rapid Eye Movement).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.signal import welch
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support, accuracy_score, balanced_accuracy_score
# Set random seed for reproducibility
np.random.seed(42)
We'll create synthetic data to simulate accelerometer, heart rate, and heart rate variability (HRV) measurements for different sleep stages. This function generates data with the following characteristics:
We then extract a bunch of features from the accelerometer and heart rate signals (e.g. min, max, mean, etc); we also extract some specialized features from heart rate corresponding to heart rate variability (e.g. rmssd, sdnn); and obtain the dominant frequency using the welch
function from SciPy.
def generate_synthetic_data(n_samples=1000, window_size=30, class_distribution=[0.5, 0.3, 0.2]):
# Generate base signals
time = np.arange(n_samples * window_size) / 100 # Assuming 100 Hz sampling rate
accel_x = np.random.randn(n_samples * window_size)
accel_y = np.random.randn(n_samples * window_size)
accel_z = np.random.randn(n_samples * window_size)
heart_rate = np.random.normal(60, 10, n_samples * window_size)
# Generate labels
labels = np.random.choice(['Wakefulness', 'REM', 'NREM'], n_samples, p=class_distribution)
labels = np.repeat(labels, window_size)
# Adjust signals based on sleep stage (with more overlap and noise)
for i, label in enumerate(labels):
if label == 'Wakefulness':
accel_x[i] += np.random.normal(0, 1.2)
accel_y[i] += np.random.normal(0, 1.2)
accel_z[i] += np.random.normal(0, 1.2)
heart_rate[i] += np.random.normal(10, 5)
elif label == 'REM':
accel_x[i] += np.random.normal(0, 0.9)
accel_y[i] += np.random.normal(0, 0.9)
accel_z[i] += np.random.normal(0, 0.9)
heart_rate[i] += np.random.normal(5, 3)
else: # NREM
accel_x[i] += np.random.normal(0, 0.8)
accel_y[i] += np.random.normal(0, 0.8)
accel_z[i] += np.random.normal(0, 0.8)
heart_rate[i] += np.random.normal(1, 2)
# Add random noise to all signals
accel_x += np.random.normal(0, 1, len(accel_x))
accel_y += np.random.normal(0, 1, len(accel_y))
accel_z += np.random.normal(0, 1, len(accel_z))
heart_rate += np.random.normal(0, 5, len(heart_rate))
# Calculate features for each window
features = []
for i in range(0, len(accel_x), window_size):
window_x = accel_x[i:i+window_size]
window_y = accel_y[i:i+window_size]
window_z = accel_z[i:i+window_size]
window_hr = heart_rate[i:i+window_size]
# Accelerometer features
accel_magnitude = np.sqrt(window_x**2 + window_y**2 + window_z**2)
f, Pxx = welch(accel_magnitude, fs=100, nperseg=window_size)
dominant_freq = f[np.argmax(Pxx)]
accel_features = {
'accel_mean': np.mean(accel_magnitude),
'accel_std': np.std(accel_magnitude),
'accel_max': np.max(accel_magnitude),
'accel_min': np.min(accel_magnitude),
'accel_median': np.median(accel_magnitude),
'accel_dominant_freq': dominant_freq
}
# Heart rate features
hr_diff = np.diff(window_hr)
hrv_features = {
'hr_mean': np.mean(window_hr),
'hr_std': np.std(window_hr),
'hr_max': np.max(window_hr),
'hr_min': np.min(window_hr),
'hrv_rmssd': np.sqrt(np.mean(hr_diff**2)),
'hrv_sdnn': np.std(window_hr),
'hrv_pnn50': np.sum(np.abs(hr_diff) > 50) / len(hr_diff) * 100
}
features.append({**accel_features, **hrv_features, 'Stage': labels[i]})
return pd.DataFrame(features)
# Generate data
data = generate_synthetic_data(2000)
data.head()
accel_mean | accel_std | accel_max | accel_min | accel_median | accel_dominant_freq | hr_mean | hr_std | hr_max | hr_min | hrv_rmssd | hrv_sdnn | hrv_pnn50 | Stage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.160171 | 1.369994 | 5.736376 | 0.770211 | 3.218683 | 33.333333 | 71.754275 | 12.820519 | 103.847061 | 54.711142 | 18.543974 | 12.820519 | 0.000000 | Wakefulness |
1 | 2.757489 | 1.187661 | 5.774216 | 0.810158 | 2.668205 | 6.666667 | 70.222542 | 11.530130 | 93.633252 | 43.950718 | 14.828998 | 11.530130 | 0.000000 | Wakefulness |
2 | 2.583652 | 0.960001 | 4.712651 | 0.857776 | 2.600193 | 30.000000 | 57.935638 | 7.986526 | 77.874840 | 36.572344 | 10.991775 | 7.986526 | 0.000000 | NREM |
3 | 2.832752 | 0.983541 | 4.633715 | 0.836868 | 2.717458 | 23.333333 | 66.557528 | 14.364074 | 91.209212 | 36.122140 | 20.304962 | 14.364074 | 0.000000 | REM |
4 | 2.812697 | 1.054019 | 5.032076 | 0.949556 | 2.848692 | 36.666667 | 70.133988 | 12.177066 | 113.358585 | 43.797419 | 19.262584 | 12.177066 | 6.896552 | Wakefulness |
After generating and preprocessing our synthetic sleep stage data, we now move on to training our model and evaluating its performance. This section covers the following steps:
Data Splitting: We separate our features (X) and labels (y), then split them into training and testing sets.
Model Training: We use a Decision Tree classifier to train on our data.
Prediction: We use our trained model to make predictions on the test set.
Visualization: We create a heatmap to visualize the confusion matrix, providing an intuitive representation of our model's performance.
These steps allow us to assess how well our model distinguishes between Wakefulness, REM, and NREM sleep stages based on the synthetic accelerometer and heart rate data we generated.
The confusion matrix will show us:
# Split data into features (X) and labels (y)
X = data.drop('Stage', axis=1)
y = data['Stage']
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train a Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=['Wakefulness', 'REM', 'NREM'])
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
# Calculate precision, recall, and F1 score
precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, average=None, labels=['Wakefulness', 'REM', 'NREM'])
# Display results
print(f"\nOverall Accuracy: {accuracy:.3f}")
# Print the confusion matrix
print("Confusion Matrix:")
print(cm)
# Visualize the confusion matrix
plt.figure(figsize=(5, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Wakefulness', 'REM', 'NREM'],
yticklabels=['Wakefulness', 'REM', 'NREM'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
Overall Accuracy: 0.773 Confusion Matrix: [[280 37 0] [ 33 98 29] [ 2 35 86]]
The precision, recall, and F1 scores provide additional insights into the model's performance for each sleep stage:
Let's proceed with the code to perform these steps and evaluate our sleep stage classifier.
from tabulate import tabulate
# Create a DataFrame for precision, recall, and F1 score
results_df = pd.DataFrame({
'Precision': precision,
'Recall': recall,
'F1 Score': f1
}, index=['Wakefulness', 'REM', 'NREM'])
# Format the DataFrame as a pretty table
table = tabulate(results_df, headers='keys', tablefmt='fancy_grid', floatfmt='.3f')
# Display results as a table
print("\nPrecision, Recall, and F1 Score:")
print(table)
Precision, Recall, and F1 Score: ╒═════════════╤═════════════╤══════════╤════════════╕ │ │ Precision │ Recall │ F1 Score │ ╞═════════════╪═════════════╪══════════╪════════════╡ │ Wakefulness │ 0.889 │ 0.883 │ 0.886 │ ├─────────────┼─────────────┼──────────┼────────────┤ │ REM │ 0.576 │ 0.613 │ 0.594 │ ├─────────────┼─────────────┼──────────┼────────────┤ │ NREM │ 0.748 │ 0.699 │ 0.723 │ ╘═════════════╧═════════════╧══════════╧════════════╛
After training our Decision Tree classifier and evaluating its performance, we now turn our attention to understanding which features contribute most significantly to the model's decision-making process. This analysis is crucial for several reasons:
Model Interpretability: It helps us understand which aspects of the accelerometer and heart rate data are most influential in distinguishing between sleep stages.
Feature Selection: Identifying the most important features can guide future data collection efforts and potentially simplify the model.
We'll use the Decision Tree's built-in feature importance metric, which measures the average reduction in entropy when a particular feature is used for splitting.
The following code will:
The horizontal bar chart provides a visual representation of the relative importance of each feature, while the printed values give us precise numerical data.
# Feature importance
importances = clf.feature_importances_
feature_names = X.columns
# Sort feature importances in descending order
sorted_idx = np.argsort(importances)
sorted_importances = importances[sorted_idx]
sorted_feature_names = feature_names[sorted_idx]
# Plot feature importances as a horizontal bar graph
plt.figure(figsize=(8, 4))
y_pos = np.arange(len(sorted_feature_names))
plt.barh(y_pos, sorted_importances)
plt.yticks(y_pos, sorted_feature_names)
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.tight_layout()
plt.show()
# Print feature importances
print("\nFeature Importances:")
for name, importance in sorted(zip(feature_names, importances), key=lambda x: x[1], reverse=True):
print("--"+f"{name}: {importance:.4f}")
Feature Importances: --hr_mean: 0.6134 --accel_mean: 0.0720 --hr_max: 0.0443 --accel_std: 0.0435 --hrv_rmssd: 0.0423 --hr_min: 0.0415 --accel_dominant_freq: 0.0270 --accel_min: 0.0265 --hrv_sdnn: 0.0261 --accel_median: 0.0259 --accel_max: 0.0231 --hr_std: 0.0145 --hrv_pnn50: 0.0000
In many real-world scenarios, we encounter datasets where the distribution of classes is not equal. This is known as class imbalance. For example, in medical diagnosis, the number of healthy patients might vastly outnumber those with a rare condition. In our sleep stage classification, we're simulating a scenario where "Wakefulness" is much more common than "REM" or "NREM" stages.
When dealing with imbalanced datasets, overall accuracy can be a misleading metric. Here's why:
Majority Class Bias: A model can achieve high accuracy simply by always predicting the majority class. For instance, if 95% of our data is "Wakefulness", a model that always predicts "Wakefulness" would be 95% accurate, despite being utterly useless for identifying the other sleep stages.
Overlooking Minority Classes: Accuracy gives equal weight to all predictions. In an imbalanced dataset, the performance on minority classes has little impact on the overall accuracy. A model could perform poorly on important but rare classes while maintaining high overall accuracy.
False Sense of Performance: High accuracy might lead us to believe our model is performing well across all classes, when in reality it might be failing to identify critical minority cases.
Balanced accuracy addresses these issues by giving equal weight to each class, regardless of its frequency in the dataset. Here's how it works:
Definition: Balanced accuracy is the average of recall obtained on each class.
Calculation: For each class, calculate the proportion of correct predictions (recall). Then, take the average of these recall values.
Interpretation: A balanced accuracy of 0.5 for binary classification (or 1/n for n classes) represents the performance of random guessing. Any value above this indicates better-than-random performance across all classes.
In the following example, we'll compare regular accuracy with balanced accuracy for both balanced and imbalanced (skewed) datasets. This comparison will highlight how balanced accuracy provides a more realistic assessment of our model's performance, especially when dealing with class imbalance.
print("\n--- Skewed Data Case ---\n")
# Generate skewed data
skewed_data = generate_synthetic_data(2000, class_distribution=[0.95, 0.03, 0.02])
# Split data into features (X) and labels (y)
X_skewed = skewed_data.drop('Stage', axis=1)
y_skewed = skewed_data['Stage']
# Split into training and testing sets
X_train_skewed, X_test_skewed, y_train_skewed, y_test_skewed = train_test_split(X_skewed, y_skewed, test_size=0.3, random_state=42)
# Train a Decision Tree classifier
clf_skewed = DecisionTreeClassifier(random_state=42)
clf_skewed.fit(X_train_skewed, y_train_skewed)
# Make predictions
y_pred_skewed = clf_skewed.predict(X_test_skewed)
# Generate confusion matrix
cm_skewed = confusion_matrix(y_test_skewed, y_pred_skewed, labels=['Wakefulness', 'REM', 'NREM'])
# Calculate precision, recall, and F1 score
precision_skewed, recall_skewed, f1_skewed, _ = precision_recall_fscore_support(y_test_skewed, y_pred_skewed, average=None, labels=['Wakefulness', 'REM', 'NREM'])
# Display results
print("Confusion Matrix (Skewed Data):")
print(cm_skewed)
# Visualize the confusion matrix
plt.figure(figsize=(5, 4))
sns.heatmap(cm_skewed, annot=True, fmt='d', cmap='Blues',
xticklabels=['Wakefulness', 'REM', 'NREM'],
yticklabels=['Wakefulness', 'REM', 'NREM'])
plt.title('Confusion Matrix (Skewed Data)')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
# Create a DataFrame for precision, recall, and F1 score
results_df_skewed = pd.DataFrame({
'Precision': precision_skewed,
'Recall': recall_skewed,
'F1 Score': f1_skewed
}, index=['Wakefulness', 'REM', 'NREM'])
# Format the DataFrame as a pretty table
table_skewed = tabulate(results_df_skewed, headers='keys', tablefmt='fancy_grid', floatfmt='.3f')
# Display results as a table
print("\nPrecision, Recall, and F1 Score (Skewed Data):")
print(table_skewed)
# Calculate and display class distribution
class_distribution_skewed = y_test_skewed.value_counts(normalize=True)
print("\nClass Distribution in Test Set (Skewed Data):")
print(class_distribution_skewed)
# Naive classifier that always predicts the majority class
naive_pred_skewed = np.full_like(y_test_skewed, 'Wakefulness')
naive_accuracy_skewed = accuracy_score(y_test_skewed, naive_pred_skewed)
print(f"\nNaive Classifier Accuracy (always predicting Wakefulness): {naive_accuracy_skewed:.3f}")
# Calculate overall accuracy for the skewed case
accuracy_skewed = accuracy_score(y_test_skewed, y_pred_skewed)
print(f"\nOverall Accuracy (Skewed Data): {accuracy_skewed:.3f}")
# In the skewed data section, add this after calculating the overall accuracy
balanced_accuracy_skewed = balanced_accuracy_score(y_test_skewed, y_pred_skewed)
print(f"Balanced Accuracy (Skewed Data): {balanced_accuracy_skewed:.3f}")
# Balanced accuracy for the naive classifier
naive_balanced_accuracy_skewed = balanced_accuracy_score(y_test_skewed, naive_pred_skewed)
print(f"Naive Classifier Balanced Accuracy: {naive_balanced_accuracy_skewed:.3f}")
--- Skewed Data Case --- Confusion Matrix (Skewed Data): [[559 11 2] [ 8 5 5] [ 0 1 9]]
Precision, Recall, and F1 Score (Skewed Data): ╒═════════════╤═════════════╤══════════╤════════════╕ │ │ Precision │ Recall │ F1 Score │ ╞═════════════╪═════════════╪══════════╪════════════╡ │ Wakefulness │ 0.986 │ 0.977 │ 0.982 │ ├─────────────┼─────────────┼──────────┼────────────┤ │ REM │ 0.294 │ 0.278 │ 0.286 │ ├─────────────┼─────────────┼──────────┼────────────┤ │ NREM │ 0.562 │ 0.900 │ 0.692 │ ╘═════════════╧═════════════╧══════════╧════════════╛ Class Distribution in Test Set (Skewed Data): Stage Wakefulness 0.953333 REM 0.030000 NREM 0.016667 Name: proportion, dtype: float64 Naive Classifier Accuracy (always predicting Wakefulness): 0.953 Overall Accuracy (Skewed Data): 0.955 Balanced Accuracy (Skewed Data): 0.718 Naive Classifier Balanced Accuracy: 0.333