Polycystic Ovarian Syndrome (PCOS) is a complex endocrine disorder affecting women of reproductive age and poses challenges for early diagnosis due to heterogeneous clinical presentations and imbalanced clinical datasets. This study aims to develop a data leakage–free machine learning pipeline to enhance the accuracy and reliability of PCOS classification using clinical data. The dataset underwent preprocessing and normalization, followed by stratified data splitting with an 80:20 ratio to maintain class proportions. The proposed pipeline was implemented within a unified computational framework integrating feature selection based on the ANOVA F-test, class imbalance handling using the Synthetic Minority Over-sampling Technique (SMOTE), and classification using a Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel. Hyperparameter tuning was performed using GridSearchCV combined with K-Fold Cross-Validation to ensure model robustness and consistency. The experimental results indicate that the proposed model achieved an accuracy of 0.9074, with precision, recall, and F1-score values of 0.8378, 0.8857, and 0.8611, respectively. Furthermore, ten dominant clinical features were identified, primarily related to hormonal profiles and ovarian morphology. These results demonstrate that the data leakage–free pipeline improves the validity and stability of PCOS prediction. The findings suggest that this approach may serve as a supportive tool for clinical decision-making, particularly in facilitating early and objective identification of PCOS.
Copyrights © 2026