U, Hayya
Unknown Affiliation

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Comparative Study on Machine Learning Algorithms for Code Smell Detection U, Hayya; Saputri, Theresia Ratih Dewi
Sinkron : jurnal dan penelitian teknik informatika Vol. 10 No. 1 (2026): Article Research January 2026
Publisher : Politeknik Ganesha Medan

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.33395/sinkron.v10i1.15439

Abstract

Detecting code smells is crucial for maintaining software quality, but rule-based methods are often not very adaptive. On the other side, existing machine learning studies often lack large-scale comparisons on modern datasets. The goal of this research is to comprehensively compare the performance of various machine learning algorithms for multi-label code smells classification in terms of effectiveness and efficiency. The dataset used in this research is SmellyCode++, containing more than 100,000 samples. Seven models: Logistic Regression, Linear SVM, Naive Bayes, Random Forest, Extra Trees, XGBoost, and LightGBM combined with Binary Relevance were trained on data balanced using random undersampling and multi-label synthetic minority over-sampling. The performance of each model was evaluated using the F1-Macro, Hamming Loss, and Jaccard Score metrics. A non-parametric statistical analysis was also conducted to validate the findings. The experiment found that ensemble-based models statically significantly outperformed the linear and probabilistic models. The performance among the top ensemble models was found to be statistically equivalent. With this statistical equivalence in accuracy, computational efficiency measured with training time became the critical tiebreaker. BR_RandomForest, BR_XGBoost, and BR_ExtraTrees proved highly efficient, while BR_LightGBM was significantly slower. This study concludes that BR_RandomForest offers the best overall trade-off in providing top tier accuracy combined with excellent computational efficiency, making it a robust choice for practical applications.