This research aims to develop an early detection classification model for diabetes risk among the productive age group (18–44 years) using a machine learning approach. Implementing the CRISP-DM methodology, this study utilized the Diabetes Health Indicators Dataset from CDC BRFSS 2015, which was refined to 48,867 observations. The class imbalance issue (4.51% diabetes positive) was addressed using the Synthetic Minority Over-sampling Technique (SMOTE) to achieve a 1:1 class ratio in the training set. Elbow curve analysis and mutual information identified 10 optimal features that balance model performance and system usability. Three algorithms were evaluated Logistic Regression, Random Forest, and XGBoost and validated using Stratified 5-Fold Cross-Validation. The results demonstrate that Logistic Regression achieved the best performance for health screening purposes with a recall of 75.06% and ROC-AUC of 83.62%, capable of detecting three out of four diabetes cases with high consistency (cross-validation: recall 75.02% ± 2.35%). This model proved to be the most effective early screening tool for diabetes risk, supporting early detection and medical intervention for the productive age population.
Copyrights © 2026