Effective communication across languages and cultures is essential in today’s interconnected world. Multimodal-multilingual language models (MMMLMs) aim to advance this goal by integrating text, speech, and visual understanding across diverse linguistic contexts. This study evaluates four leading MMMLMs-GIT, mPLUG, CLIP, and Whisper + GPT-4V-on cross lingual and cross-modal tasks, including image captioning, visual question answering, speech-to-image generation, and idiomatic translation. Performance was assessed in high-resource (English, Arabic), medium resource (Malay), and low-resource (Macedonian) settings. Results show strong performance in structured tasks but notable limitations in cultural reasoning, figurative language interpretation, and semantic grounding in low-resource environments. GIT delivered the most consistent multilingual results, while Whisper + GPT-4V excelled in fluency yet lacked cultural sensitivity. To address these gaps, the study proposes culturally informed evaluation protocols that integrate quantitative metrics such as BLEU, CIDEr, and F1 with qualitative, community-centered approaches. These include cross-cultural annotation panels, inter-rater reliability validation using Cohen’s kappa, and a novel “cultural fidelity” metric to measure alignment with culturally specific norms. The findings emphasize the need for inclusive datasets, ethical development, and interdisciplinary collaboration to ensure MMMLMs support equitable and culturally aware global communication.
Copyrights © 2026