Backdoor malware represents one of the most critical threats in the Android ecosystem due to its capability to enable covert remote access, escalate privileges, and exfiltrate sensitive data without user awareness. Although the CCCS-CIC-AndMal-2020 dataset is publicly available, prior studies have not specifically formulated Backdoor detection as a binary classification problem under extreme class imbalance, nor systematically evaluated the impact of oversampling and cost-sensitive weighting using imbalance-aware performance metrics. This study proposes a comprehensive detection pipeline that integrates ensemble learning, class imbalance handling strategies, and explainability-based analysis to extract behavioral signatures of Backdoor malware. A two-stage feature selection process is employed to reduce the original 9,502-dimensional feature space to 500 informative features. Subsequently, five classification algorithms are evaluated under three imbalance-handling scenarios using a composite ranking criterion based on F1-score, Area Under the Receiver Operating Characteristic Curve (AUC), Geometric Mean (G-Mean), and Matthews Correlation Coefficient (MCC). The experimental results demonstrate that the Random Forest model combined with Synthetic Minority Oversampling Technique (SMOTE) achieves the best performance, with an F1-score of 0.9043, AUC of 0.9909, G-Mean of 0.9422, and MCC of 0.8948. Furthermore, SHAP analysis identifies 39 Android permissions related to account access, covert communication, and privilege escalation as key behavioral signatures, with the permissions feature group contributing 2.31 times higher discriminative importance than nonpermission features. These findings indicate that interpretable ensemble learning not only improves detection performance but also provides actionable insights for static malware analysis.
Copyrights © 2026