This study evaluates the effectiveness of multimodal learning in Artificial Intelligence of Things (AIoT) systems, focusing on the integration of sensor fusion and computer vision for classification tasks. A systematic review and meta-analysis were conducted on studies published between 2020 and 2025. Thirteen studies met the inclusion criteria; however, only six provided comparable quantitative data due to inconsistent baseline reporting and evaluation practices. The results indicate that multimodal approaches generally improve accuracy compared to unimodal baselines when comparable evaluations are available, with an average increase of 8.88% (95% CI: 5.33%–12.44%, p < 0.001). High heterogeneity was observed, influenced by domain, sensor configuration, and model architecture. These findings suggest that multimodal effectiveness is conditional and depends on modality complementarity, fusion strategy, and system-level constraints
Copyrights © 2025