Diabetes is one of the most serious global health problems and continues to increase significantly worldwide. Early detection is essential to reduce complications and improve patient survival rates. Recently, Machine Learning (ML) has shown great potential in supporting early diabetes prediction through data-driven analysis. However, the presence of irrelevant and redundant features may decrease model performance and increase computational complexity. Therefore, this study aims to evaluate the effectiveness of feature selection techniques and ML algorithms for early diabetes detection using the PIMA Indians Diabetes Dataset. The dataset consists of 768 records, 8 features, and two classes. Data preprocessing was conducted to handle missing values and outliers using mean imputation and data cleaning techniques. Three feature selection methods were applied, namely Information Gain (IG), Gain Ratio (GR), and ANOVA, to identify the most relevant features. Furthermore, several ML algorithms, including k-Nearest Neighbor (k-NN), Random Forest, Support Vector Machine (SVM), Naive Bayes, and Neural Network, were evaluated using 10-fold cross-validation. The results showed that feature selection techniques improved classification performance compared to using all features. Glucose, BMI, Age, and Insulin were identified as the most influential features in diabetes prediction. Among all evaluated models, Random Forest combined with ANOVA achieved the best performance with an accuracy of 0.753. In general, the application of feature selection techniques increased model accuracy by up to 3.82%. These findings demonstrate that combining effective feature selection methods with robust ML algorithms can significantly enhance the performance of early diabetes detection systems.
Copyrights © 2026