Claim Missing Document
Check
Articles

Found 11 Documents
Search

Enhancing Early Diabetes Detection Using Tree-Based Machine Learning Algorithms with SMOTEENN Balancing Lonang, Syahrani; Putra, Ahmad Fatoni Dwi; Firdaus, Asno Azzawagama; Syuhada, Fahmi; Sa'adati, Yuan
Mobile and Forensics Vol. 8 No. 1 (2026)
Publisher : Universitas Ahmad Dahlan

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.12928/mf.v8i1.14495

Abstract

Diabetes continues to be a critical global health issue, demanding accurate predictive systems to enable preventive interventions. Traditional diagnostic tests lack efficiency for large-scale early screening, which has led to growing interest in artificial intelligence solutions. This research proposed an effective methodology for diabetes classification based on tree-based algorithms enhanced with SMOTEENN balancing. The study employed the Kaggle Diabetes Prediction Dataset with 100,000 instances and eight medical and demographic features. Preprocessing steps included handling missing and duplicate values, encoding categorical variables, and scaling numerical attributes with Min-Max normalization. To address severe class imbalance, SMOTEENN was adopted, producing a cleaner and more balanced dataset. Model evaluation was performed using Stratified 5-Fold cross-validation on six classifiers: Decision Tree, Random Forest, Gradient Boosting, AdaBoost, XGBoost, and CatBoost. Experimental results indicated significant gains after balancing, with ensemble methods outperforming single-tree baselines. Random Forest delivered the best overall performance (98.93% accuracy, 98.96% F1-score, 99.16% recall, 99.94% AUC), followed by CatBoost and XGBoost with comparable results above 99% AUC. While Decision Tree benefited most from SMOTEENN in relative terms, it remained less competitive. Analysis of the importance of the analysis revealed HbA1c level and blood glucose level as dominant predictors, validating clinically meaningful learning. These findings suggest that integrating hybrid resampling with ensemble tree classifiers provides reliable and general predictions for diabetes risk. The approach holds promise for deployment in healthcare decision support systems.