Lontar Komputer: Jurnal Ilmiah Teknologi Informasi
Vol. 16 No. 02 (2025): Vol.16, No. 02 August 2025

Evaluation of the performance of the Smote, Smote Enn, and Borderline Smote resampling methods based on the number of outlier data with Z Score

arisgunadi gunadi (undiksha)
Dewi Oktofa Rahmawati (Undiksha)
Nurfa Risha (Undiksha)



Article Info

Publish Date
30 Aug 2025

Abstract

Handling class imbalances in datasets is a significant challenge in the classification process. Disruption occurs if the minority class has a crucial role in decision-making. Oversampling is one of the solutions that is widely used to overcome this problem. This study compares the performance of three popular oversampling methods, namely SMOTE (Synthetic Minority Oversampling Technique), SMOTE-ENN (SMOTE with Edited Nearest Neighbor), and Borderline-SMOTE, based on the number of outlier data produced. Outlier data is measured using a Z-score-based statistical approach. The research was conducted by applying the three oversampling methods on several datasets. Evaluation is carried out by counting the number of outlier data after the resample process, as well as by evaluating their impact on the performance of the classification model using metrics such as accuracy, precision, recall, and F1-score. The research results show that there is no significant difference in the number of outlier data in SMOTE, ENN SMOTE, or borderline SMOTE. In the diabetes.csv dataset, it was found that the percentage of outlier data in the initial condition and the condition after resampling with SMOTE, resampling with SMOTE ENN, and borderline SMOTE were 7.4%, 6.8%, 6.7%, and 63%, respectively. For the predict_ honor.csv dataset, the data are 7.1%, 7.3%, 7.6%, and 7%. For the winequality.csv dataset, the data are 8%, 7.8%, 6.8%, and 5.8%. Meanwhile, smoking.csv data found 7.1%, 7.3%, 7.6%, and 7.0%. However, if we look at each feature in each dataset, more varied conditions are found regarding the performance of the three algorithms, which is related to the number of outlier data produced. In terms of differences, no significant differences were found in the number of outlier data produced. The second finding is related to the performance of the decision tree classification model. It can be stated that the influence of feature correlation is more important than perfect data balance in the dataset.

Copyrights © 2025






Journal Info

Abbrev

lontar

Publisher

Subject

Computer Science & IT Control & Systems Engineering Decision Sciences, Operations Research & Management Electrical & Electronics Engineering Engineering

Description

Lontar Komputer: Jurnal Ilmiah Teknologi Informasi focuses on the theory, practice, and methodology of all aspects of technology in the field of computer science and engineering. It provides an international publication platform to boost the scientific and academic publication of research in the ...