Remote sensing scene classification faces significant challenges in distinguishing visually similar land-use categories due to high intraclass variation and interclass similarity in high-resolution imagery. Although deep learning approaches have shown promise, single-architecture methods often fail to capture the diverse spatial and hierarchical features required for robust scene discrimination. This study proposes MSDFF-RCNet, a multi-structure data fusion framework combined with recurrent attention mechanisms to enhance remote sensing scene classification performance. The framework integrates complementary feature representations from AlexNet, ResNet50, and DenseNet161 architectures, while the recurrent attention mechanism focuses on discriminative spatial regions for improved classification accuracy. Comprehensive experiments conducted on four benchmark datasets demonstrate substantial performance improvements over the baseline ARCNet architecture: UC Merced (43.8% to 84.9%, +41.1%), AID (63.8% to 94.4%, +30.6%), NWPU-RESISC45 (61.5% to 95.4%, +33.9%), and OPTIMAL 31 (47.3% to 87.9%, +40.6%). Statistical significance analysis confirmed the reliability of these improvements (p < 0.01), while comprehensive evaluation across precision, recall, and F1-score metrics validated the framework’s robustness. Although the multi-structure approach requires substantial computational resources (25.6× parameter increase), the consistent and significant accuracy improvements across diverse datasets demonstrate the effectiveness of complementary feature fusion for remote sensing scene classification. The proposed framework provides a valuable contribution to automated Earth observation systems that require high-precision land-use classification capabilities.