Diabetes is one of the non-communicable diseases that is often detected at an advanced stage, thereby increasing the risk of serious complications. The application of machine learning has the potential to support early diabetes detection; however, most previous studies have focused on large-scale datasets and high predictive accuracy, while methodological evaluations on small-sized clinical data remain limited. This study aims to evaluate and compare the performance of several machine learning algorithms for early diabetes prediction using a limited clinical dataset, with particular emphasis on analyzing the impact of data characteristics on model performance. The dataset used in this study consists of 22 samples with eight clinical features and one target variable, which were divided into 17 training samples and 5 testing samples. The research stages include data preprocessing, training–testing data splitting, model training, and performance evaluation using accuracy, precision, recall, F1-score, and ROC-AUC metrics. The algorithms evaluated include Logistic Regression, Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and XGBoost. The experimental results indicate that none of the evaluated models were able to effectively detect the diabetes class, as reflected by precision, recall, and F1-score values of zero across all models. Although Random Forest and XGBoost achieved an accuracy of 0.6, this value was largely influenced by the dominance of the non-diabetes class in the very limited test set. Correlation analysis further reveals that Glucose, BMI, and Diabetes Pedigree Function are the most influential features associated with diabetes status. The main contribution of this study lies in providing a realistic methodological evaluation of machine learning models applied to small-sized clinical datasets, highlighting that limited sample size and training–testing data partitioning have a substantial impact on model performance and the interpretation of evaluation metrics. These findings provide an important methodological reference for future studies aiming to develop more reliable early diabetes prediction models under constrained clinical data conditions.
Copyrights © 2025