Software defect prediction plays a crucial role in software quality assurance by enabling early identification of defect-prone modules, thereby reducing testing effort and improving software reliability. This study presents a comprehensive comparative analysis of three widely used deep learning architectures Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN), and Long Short-Term Memory (LSTM) for software defect prediction under identical experimental conditions. A systematic seven-phase framework was employed, covering data collection, preprocessing, feature engineering, model implementation, training, validation, and comparative evaluation using twelve datasets from the NASA Metrics Data Program. Experimental results indicate that the LSTM architecture consistently outperforms CNN and MLP, achieving an average accuracy of 93.5%, precision of 94.2%, recall of 93.1%, F1-score of 93.6%, and ROC-AUC of 0.947 across all datasets. Statistical significance analysis using Friedman and Wilcoxon signed-rank tests confirms that the performance improvements of LSTM are statistically significant (p < 0.001) with large effect sizes. Furthermore, cross-dataset evaluation demonstrates that LSTM exhibits superior generalization capability, with a smaller average accuracy degradation compared to CNN and MLP. The study also highlights important trade-offs between predictive performance and computational efficiency, providing practical guidance for architecture selection in real-world software defect prediction systems. These findings contribute empirical insights and deployment-oriented recommendations for advancing automated software quality assurance.
Copyrights © 2025