Garuda - Garba Rujukan Digital

Article Per Year (5 Year)

p-Index From 2020 - 2025

0.23

P-Index

This Author published in this journals

All Journal Science in Information Technology Letters

Hardiyanti P, Cicin

Unknown Affiliation

Author-ID : 8847355

Computer Science & IT

Published : 1 Documents Claim Missing Document

Claim Missing Document

Articles

Optimizing breast cancer classification using SMOTE, Boruta, and XGBoost Hardiyanti P, Cicin
Science in Information Technology Letters Vol 6, No 1 (2025): May 2025
Publisher : Association for Scientific Computing Electronics and Engineering (ASCEE)

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.31763/sitech.v6i1.2109

Breast cancer remains one of the leading causes of death among women worldwide. This study aims to develop a clinical data-based breast cancer classification framework by integrating the Synthetic Minority Oversampling Technique (SMOTE), the Boruta feature selection algorithm, and the XGBoost classifier. The proposed approach is tested using the Wisconsin Breast Cancer Diagnostic (WBCD) dataset, consisting of 569 samples and 30 numerical features. SMOTE addresses class imbalance, Boruta selects the most relevant diagnostic features, and XGBoost is the main classification algorithm due to its tabular and imbalanced data robustness. Model validation is conducted through Repeated Stratified K-Fold Cross Validation with 30 repetitions to ensure statistical stability. The resulting model achieves excellent classification performance, with an average accuracy of 0.9608 ± 0.0274, precision of 0.9465 ± 0.0481, Recall of 0.9512 ± 0.0524, and F1-score of 0.9475 ± 0.0374. The ROC-AUC value reaches 0.9926 ± 0.0094, the PR-AUC is 0.9906 ± 0.0113, and the Matthews Correlation Coefficient (MCC) is 0.9179 ± 0.0575, indicating a well-balanced model. Clinically, this model can aid early diagnosis by effectively reducing irrelevant diagnostic attributes, retaining only 10 key features without compromising accuracy, thereby offering a lightweight yet reliable diagnostic tool. However, limitations include the relatively small dataset and the absence of hyperparameter tuning. Future research should explore larger datasets, advanced ensemble methods, and interpretability techniques such as SHAP or LIME to improve clinical transparency and adoption.

Co-Authors

Title

Found 1 Documents
Search

Abstract

Title Search

Found 1 Documents Search

Abstract

Title

Found 1 Documents
Search