Sentiment analysis has evolved from text-based approaches to multimodal sentiment analysis (MSA), which integrates textual and visual data to enhance the accuracy of emotional understanding, especially in visually rich social media contexts. This study presents a systematic literature review (SLR) focusing on recent developments in text-image-based MSA, aiming to identify prevailing methods, fusion strategies, and major research gaps. Following the PRISMA protocol, a total of 20 key articles published between 2019 and 2024 were selected and analyzed. The results indicate that deep learning models such as LXMERT, ViLBERT, and ERNIE-ViL outperform traditional architectures, achieving accuracies above 80% on datasets like MVSA and Twitter. Attention mechanisms and advanced feature fusion techniques significantly contribute to improving both accuracy and interpretability. However, challenges remain in terms of annotation quality, semantic alignment across modalities, and real-time implementation constraints. This study contributes by mapping the state-of-the-art in multimodal sentiment analysis, highlighting underexplored research gaps, and offering directions for future work toward more adaptive and context-aware sentiment systems
Copyrights © 2025