Multimodal facial expression recognition aims to improve emotion analysis by integrating visual, audio, and textual cues to achieve accuracy and robustness. However, effectively recognizing facial expressions across video, text, and audio presents challenges due to inconsistencies in how emotions are expressed among these modalities. To overcome this issue, this research proposes a residual mogrifier long short-term memory (RMLSTM) model to enhance robustness in multimodal facial expression recognition. By integrating residual connections into the long short-term memory (LSTM), the model improves its ability to capture complex dependencies among various modalities, including video, text, and audio. The residual connection overcomes the vanishing gradient problem and ensures stable training with better gradient flow in deeper networks. The mogrifier mechanism refines the input features dynamically, enhancing feature interaction and alignment across modalities. The RMLSTM achieves 99.57% and 97.83% accuracy on the SAVEE and YouTube datasets, respectively, outperforming both the mel-frequency cepstral coefficients time-domain feature with iterative dilated convolutional neural network (MFCCT-1DCNN) and attention-based multi-modal popularity prediction model of short-form videos (AMPS).
Copyrights © 2026