Background: Diabetes is a chronic disease with increasing global prevalence, making early detection essential. Machine learning has shown strong potential in improving prediction accuracy; however, robust validation and systematic optimization are still required. Aims: This study tries to compare different machine learning methods to predict diabetes using a. reproducible and methodologically sound framework. Methods: The Pima Indian Diabetes dataset (768 samples, 8 clinical features) was used. Six algorithms were evaluated: Logistic Regression, K-Nearest Neighbors, Decision Tree, Random Forest, Support Vector Machine, and Gradient Boosting. Hyperparameter tuning was done with GridSearchCV, and the models were checked using stratified 5-fold cross-validation. The performance of the model was assessed using several metrics including accuracy, precision, recall, F1-score, and AUC-ROC. Results: The results show that ensemble methods outperform traditional models. Random Forest achieved the highest The model performed with an accuracy of 98% plus or minus 1.8% and an AUC-ROC of 0.999 plus or minus 0.02, then Gradient Boosting achieved 91% plus or minus 2.1%. Logistic Regression and KNN had lower performance with accuracy scores of 79% plus or minus 2.3% and 77% plus or minus 2.5%, respectively. The analysis of which features are most important found that glucose levels, BMI, and age are the top factors that have the biggest influence. Conclusion: The study demonstrates that ensemble methods combined with hyperparameter optimization and robust validation significantly improve diabetes prediction performance and can support clinical decision-making.
Copyrights © 2026