Claim Missing Document
Check
Articles

Found 1 Documents
Search

COMPARISON OF OVERSAMPLING, UNDERSAMPLING, AND SMOTE TECHNIQUES FOR MULTICLASS BALANCE DATA HANDLING IN RANDOM FOREST AND MULTINOMIAL LOGISTIC REGRESSION Fadjryani; Asfar; Nazwa; Tokandari, Allin Floria; Lestari, Tri Andayani; Ghani, Muhammad Azi Zarir
Jurnal Statistika dan Aplikasinya Vol. 9 No. 2 (2025): Jurnal Statistika dan Aplikasinya
Publisher : LPPM Universitas Negeri Jakarta

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.21009/JSA.09207

Abstract

Class imbalances in multiclass classifications are an important challenge in applied machine learning, particularly in the medical field such as predicting how patients will exit. Although various studies have demonstrated the effectiveness of resampling techniques, the best combination of classification algorithms and balancing methods for highly unbalanced multiclass hospital data is still rarely studied. This study aims to compare the performance of Random Forest (RF) and Multinomial Logistic Regression (MLR) algorithms in dealing with class imbalances using three resampling techniques: Random Oversampling (ROS), Random Undersampling (RUS), and Synthetic Minority Oversampling Technique (SMOTE). The dataset used included 1,032 inpatients with Non-Insulin-Dependent Diabetes Mellitus (NIDDM) at Undata Hospital, Central Sulawesi, for the period January 2021 to December 2023. Data pre-processing includes coding, normalization, and data sharing by stratified sampling (80:20). Feature selection was conducted using Recursive Feature Elimination (RFE), and model evaluation was conducted with 5-fold cross-validation using accuracy, recall, F1-score, and MCC metrics. The results showed that the combination of RF and ROS provided the best performance with an accuracy of 93.65%, F1-macro of 0.935, and a balanced accuracy of 0.95. This combination has been shown to be able to recognize minority classes well without sacrificing overall accuracy. In contrast, the MLR model shows the lowest performance, especially when using RUSs that cause the loss of important data. Although SMOTE is showing competitive results, it remains below ROS in this context. This study was limited to structured clinical data and only compared two types of classification models. In the future, deep learning-based approaches or advanced ensembles can be explored. The novelty of this study lies in the thorough evaluation of the combination of balancing techniques and classical classification algorithms for medical predictions with extremely unbalanced multiclass data.