Proceeding International Collaborative Conference on Multidisciplinary Science
Vol. 2 No. 2 (2025): December : ICCMS (Proceeding International Collaborative Conference on Multidis

XGBoost for Educational Performance: Comparing SMOTE and SMOTE-TOMEK on Imbalanced Data




Article Info

Publish Date
19 Aug 2025

Abstract

Class imbalance poses a critical challenge in educational performance prediction, particularly in accurately identifying at-risk students within small datasets. This study rigorously evaluates three data balancing strategies—baseline imbalanced processing, SMOTE (Synthetic Minority Over-sampling Technique), and SMOTE-TOMEK—integrated with the XGBoost classifier, using academic records from 161 Indonesian junior high school students. The objective is to assess their effectiveness in improving minority-class recognition and overall model reliability. The results demonstrate that SMOTE-TOMEK significantly outperforms other methods, achieving a 75% recall for the minority class—representing a 50% absolute improvement over both SMOTE and the baseline. It also recorded the highest scores across key metrics: AUC-PR (0.9874), Matthews Correlation Coefficient (0.6786), and G-mean (0.8345). Notably, SMOTE-TOMEK identified one additional at-risk student for every four cases without compromising majority-class precision (93%), highlighting its practical utility in real-world educational interventions. In contrast, while SMOTE improved probabilistic metrics such as AUC-ROC (0.9286), it failed to reduce false negatives, maintaining the baseline’s 50% error rate in identifying at-risk students. The optimal SMOTE-TOMEK configuration enabled the use of shallower decision trees and stronger regularization, validating its effectiveness in reducing noise and enhancing generalization. Statistical significance of the results was confirmed using Wilcoxon signed-rank tests at a 0.01 significance level. These findings underscore the importance of hybrid resampling techniques in educational AI pipelines. SMOTE-TOMEK not only enhances predictive accuracy but also translates model performance into actionable insights for supporting marginalized learners. The study advocates for its prioritization in future educational data science applications, especially where early identification of vulnerable students is essential for targeted academic support and policy formulation.

Copyrights © 2025






Journal Info

Abbrev

ICCMS

Publisher

Subject

Other

Description

ICCMS (International Collaborative Conference on Multidisciplinary Science) is an open access Journal published by the IFREL ( International Forum of Researchers and Lecturers). ICCMS accepts manuscripts based on empirical research results, new scientific literature review, and comments/ criticism ...