Lung adenocarcinoma, a leading cause of cancer-related mortality, underscores the need for reliable diagnostic tools. This study proposes a robust multi-stage feature selection and classification framework for biomarker discovery, using the cancer genome atlas lung adenocarcinoma (TCGA-LUAD) as the primary dataset and GSE19188 for independent validation. The framework combines differential expression analysis (Wilcoxon rank-sum test), joint mutual information maximization (JMIM), and sparse autoencoder-based refinement to identify a compact and predictive set of five genes. These genes are involved in key lung cancer pathways, including epidermal growth factor receptor (EGFR) signaling, cell cycle regulation, and immune response, and include biomarkers such as surfactant protein A2 (SFTPA2), napsin an aspartic peptidase (NAPSA), and T-box transcription factor 4 (TBX4). The hybrid deep learning classifier achieved high accuracy (98.4%) and area under the receiver operating characteristic curve (AUC-ROC) (0.996) on TCGA-LUAD, with strong generalization on GSE19188 (accuracy: 96.7%, AUC-ROC: 0.993%). Overall, the framework offers an interpretable and effective solution for LUAD classification and biomarker identification.
Copyrights © 2025