Cross Project Defect Prediction (CPDP) is used to overcome the limitation of training data on new projects. This study tests the performance of machine learning models (Random Forest, CatBoost, Logistic Regression, KNN, SVM) in a CPDP scenario with the AEEEM dataset, comparing results before and after hyperparameter adjustment. Models were tested using a one-to-many CPDP architecture, with evaluation metrics of Accuracy, AUC, Recall, Precision, and F1-Score. As a result, Random Forest excels in 9 out of 20 prediction combinations, followed by CatBoost which is best in 4 combinations after tuning. KNN and SVM won in 3 and 2 combinations respectively, while Logistic Regression only excelled in 2 combinations. Hyperparameter tuning improved the performance of all models except Logistic Regression, with SVM improving most significantly (6.39%), followed by KNN (3.94%), RF (5.14%), and CatBoost (1.4%). Project combinations such as LC ? EQ, ML ? EQ, and PDE ? EQ performed well, demonstrating the effectiveness of CPDP when projects are similar. In contrast, combinations such as EQ ? ML and ML ? LC performed poorly due to differences in data distribution. This study concludes that CPDP is effective for software defect prediction when local data is limited, and can be the basis for further research such as transfer learning or project selection based on semantic similarity.
Copyrights © 2025