Annisa Nur Ramadhani
Universitas Muhammadiyah Surakarta

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

How Well Do Vision-Language Models Explain Sarcasm? An Evaluation of Multimodal Explanation Quality for Social Media Posts Ikhlasul Amal; Annisa Nur Ramadhani
Artificial Intelligence Systems and Its Applications Vol. 1 No. 1 (2025): Vol. 1, No. 1, June 2025
Publisher : CV Cognispectra Publishing

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.65917/aisa.v1i1.22

Abstract

Sarcasm is a complex communicative phenomenon frequently encountered in social media, where the literal meaning of language sharply contradicts the speaker’s true intent, often reinforced by multimodal cues such as incongruent images or memes. While prior research has primarily focused on detecting sarcasm, far less attention has been devoted to generating human-interpretable explanations that clarify why content is sarcastic. This study addresses this gap by systematically evaluating the capabilities of fifteen Vision–Language Models (VLMs) of varying parameter sizes to produce multimodal sarcasm explanations under zero-shot and few-shot learning conditions. Using the publicly available MORE dataset of social media posts annotated with concise human-written explanations, we benchmarked each model’s outputs with three widely used evaluation metrics, including ROUGE, BERTScore, and Sentence-BERT, to assess both surface-level overlap and deeper semantic alignment. Our findings reveal that smaller models can rival or even outperform larger architectures in n-gram similarity measures, while embedding-based metrics often yield high scores even when generated explanations contradict the ground truth. These results highlight the limitations of current automatic metrics in reliably capturing the nuanced reasoning underlying sarcasm. Overall, this work demonstrates that model scale does not consistently predict explanation quality and underscores the need for more robust evaluation protocols.