The healthcare industry has benefited greatly from the quick development of artificial intelligence, especially machine learning (ML). Unbalanced data is a significant problem in medical classification, as it can impair model performance, particularly when it comes to identifying important minority classes like patients with particular diseases. The purpose of this research is to evaluate how well two ensemble-based algorithms—Random Forest and Gradient Boosting—perform when dealing with data imbalance in diabetes prediction. Age, body mass index, smoking history, HbA1c level, blood glucose level, and other demographic and medical variables are included in the dataset, which was acquired from Kaggle. Data preprocessing, train-test splitting, model implementation with default parameters, and hyperparameter tuning with Grid Search and Cross Validation comprise the methodology. Accuracy, precision, recall, F1-score, and AUC-ROC metrics were used to assess the model's performance. Both models achieved high accuracy above 97%, according to the results. Following tuning, Random Forest achieved 97.06% accuracy, 0.974 AUC, and 0.99 positive-class precision; however, recall somewhat declined, possibly resulting in underdiagnosis. Gradient Boosting, on the other hand, showed consistent performance with an AUC of 0.9791 and an F1-score of 0.81. These results demonstrate that model performance can be enhanced by hyperparameter tuning; however, algorithm selection should be based on the needs of the application, especially in medical settings where striking a balance between sensitivity and diagnostic precision is crucial.
Copyrights © 2025