Detecting code smells is crucial for maintaining software quality, but rule-based methods are often not very adaptive. On the other side, existing machine learning studies often lack large-scale comparisons on modern datasets. The goal of this research is to comprehensively compare the performance of various machine learning algorithms for multi-label code smells classification in terms of effectiveness and efficiency. The dataset used in this research is SmellyCode++, containing more than 100,000 samples. Seven models: Logistic Regression, Linear SVM, Naive Bayes, Random Forest, Extra Trees, XGBoost, and LightGBM combined with Binary Relevance were trained on data balanced using random undersampling and multi-label synthetic minority over-sampling. The performance of each model was evaluated using the F1-Macro, Hamming Loss, and Jaccard Score metrics. A non-parametric statistical analysis was also conducted to validate the findings. The experiment found that ensemble-based models statically significantly outperformed the linear and probabilistic models. The performance among the top ensemble models was found to be statistically equivalent. With this statistical equivalence in accuracy, computational efficiency measured with training time became the critical tiebreaker. BR_RandomForest, BR_XGBoost, and BR_ExtraTrees proved highly efficient, while BR_LightGBM was significantly slower. This study concludes that BR_RandomForest offers the best overall trade-off in providing top tier accuracy combined with excellent computational efficiency, making it a robust choice for practical applications.
Copyrights © 2026