Stroke is a leading cause of global death and disability. This study proposes a stroke risk classification model using ensemble learning that combines Random Forest and XGBoost algorithms. A Kaggle dataset with 5110 samples (249 stroke, 4861 non-stroke) presented significant class imbalance. To address this, a comprehensive preprocessing pipeline was implemented, including feature encoding, feature scaling, feature selection using ANOVA F-test, outlier handling with Z-Score and IQR methods, and missing value imputation using MICE. The SMOTE-ENN approach was applied to handle class imbalance, resulting in a more balanced sample distribution. The dataset was split into 80% training and 20% testing data (hold-out test) to ensure objective evaluation. Hyperparameter optimization was performed using Bayesian optimization, while model evaluation employed stratified K-fold cross-validation to prevent overfitting. Validation on the hold-out test set demonstrated exceptional ensemble model performance with an AUC of 0.99, 98% accuracy, 98% precision, and 98% recall. Feature importance analysis identified average glucose level and age as the strongest stroke risk predictors. The proposed approach significantly improved predictive accuracy compared to previous research, demonstrating the effectiveness of ensemble learning and preprocessing methods in developing reliable, high-performing machine learning models for early stroke risk assessment.
Copyrights © 2025