Hardiyanti P, Cicin
Unknown Affiliation

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Optimizing breast cancer classification using SMOTE, Boruta, and XGBoost Hardiyanti P, Cicin
Science in Information Technology Letters Vol 6, No 1 (2025): May 2025
Publisher : Association for Scientific Computing Electronics and Engineering (ASCEE)

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.31763/sitech.v6i1.2109

Abstract

Breast cancer remains one of the leading causes of death among women worldwide. This study aims to develop a clinical data-based breast cancer classification framework by integrating the Synthetic Minority Oversampling Technique (SMOTE), the Boruta feature selection algorithm, and the XGBoost classifier. The proposed approach is tested using the Wisconsin Breast Cancer Diagnostic (WBCD) dataset, consisting of 569 samples and 30 numerical features. SMOTE addresses class imbalance, Boruta selects the most relevant diagnostic features, and XGBoost is the main classification algorithm due to its tabular and imbalanced data robustness. Model validation is conducted through Repeated Stratified K-Fold Cross Validation with 30 repetitions to ensure statistical stability. The resulting model achieves excellent classification performance, with an average accuracy of 0.9608 ± 0.0274, precision of 0.9465 ± 0.0481, Recall of 0.9512 ± 0.0524, and F1-score of 0.9475 ± 0.0374. The ROC-AUC value reaches 0.9926 ± 0.0094, the PR-AUC is 0.9906 ± 0.0113, and the Matthews Correlation Coefficient (MCC) is 0.9179 ± 0.0575, indicating a well-balanced model. Clinically, this model can aid early diagnosis by effectively reducing irrelevant diagnostic attributes, retaining only 10 key features without compromising accuracy, thereby offering a lightweight yet reliable diagnostic tool. However, limitations include the relatively small dataset and the absence of hyperparameter tuning. Future research should explore larger datasets, advanced ensemble methods, and interpretability techniques such as SHAP or LIME to improve clinical transparency and adoption.