Monitoring mental health via social media often utilizes unimodal approaches, such as sentiment analysis on text or single-staged image categorization, or executes early feature fusion. However, in real-world contexts where emotions are conveyed via text, emojis, and images, unimodal approach leads to obscured decision-making pathways and overall diminished performance. To overcome these limitations, we propose EmoVibe, a hybrid multimodal AI framework for emotive analysis. EmoVibe uses attention-based late fusion strategy, where text embeddings are generated from bidirectional encoder representations from transformers (BERT) and visual features are extracted by vision transformer. Subsequently, emoticon vectors linked to avatars are processed independently. Later, these independent data features are integrated at higher levels, enhancing interpretability and performance. In contrast to early fusion methods and integrated multimodal large language models (LLMs) like CLIP, Flamingo, GPT-4V, MentaLLaMA, and domain-adapted models like EmoBERTa, EmoVibe preserves modality-specific contexts without premature fusion. This architecture saves processing cost, allowing for clearer, unambiguous rationalization and explanations. EmoVibe outperforms unimodal baselines and early fusion models, obtaining 89.7% accuracy on GoEmotions, FER, and AffectNet, compared to BERT's 87.4% and ResNet-50's 84.2%, respectively. Furthermore, a customizable, real time, privacy-aware dashboard is created, supporting physicians and end users. This technology enables scalable and proactive intervention options and fosters user self-awareness of mental health.
Copyrights © 2025