Emotion recognition is a vital component of human–computer interaction and intelligent systems, yet robust multimodal emotion recognition remains challenging due to high-dimensional input space, noisy features, and the complexity of integrating heterogeneous modalities. This study proposes a novel hybrid multimodal framework that enhances both accuracy and computational efficiency by combining the semantic representation capability of Large Language Models (LLMs) with the optimization strengths of metaheuristic algorithms. In the proposed approach, an LLM is utilized to extract high-level contextual features from text and audio streams, while the Binary Artificial Hummingbird Algorithm (BAHA) performs feature selection to remove redundant attributes. Subsequently, the Goose Algorithm (GA) optimizes classifier hyperparameters, and the Komodo Mlipir Algorithm (KMA) conducts late fusion of the final multimodal outputs. Experiments conducted on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset, evaluated on six emotion categories, demonstrate that this hybrid approach successfully captures subtle affective cues and surpasses state-of-the-art baselines, achieving an accuracy of 87.5%. Integrating LLMs with multiple specialized metaheuristics therefore yields a substantially more robust emotion recognition pipeline and represents a promising direction toward the development of more emotionally intelligent systems.
Copyrights © 2026