Aggrwal, Mayank
Unknown Affiliation

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Transformer-based Hindi image description and storytelling using enhanced attention and FastText embeddings Sharma, Anjali; Aggrwal, Mayank; Khanna, Jitin
IAES International Journal of Artificial Intelligence (IJ-AI) Vol 15, No 2: April 2026
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/ijai.v15.i2.pp1771-1782

Abstract

This work presents a novel image description generation framework that combines a Transformer-based encoder-decoder architecture with a custom squeeze-and-excitation (SE) attention block integrated into an EfficientNet feature extractor. The decoder uses FastText embeddings specifically trained for Hindi and is evaluated on the Microsoft common objects in context (MS-COCO) dataset. To improve the captioning process, the model incorporates a generative pre-trained transformer (GPT) module to generate narrative descriptions based on the initial captions and applies multiple similarity metrics to assess output quality. The proposed system significantly outperforms existing methods, achieving high bilingual evaluation understudy (BLEU) scores (BLEU-1 to BLEU-4: 83.24, 73.17, 64.56, and 58.22), a consensus-based image description evaluation (CIDEr) score of 81.41, an F1 score of 90.29, and a metric for evaluation of translation with explicit ordering (METEOR) score of 81.18, indicating strong caption accuracy. Furthermore, the model achieves low error rates, with a word error rate (WER) of 15% and a character error rate (CER) of 11%. This work highlights the challenges of applying large-scale datasets like MS-COCO to resource-limited languages and demonstrates the effectiveness of integrating FastText embeddings with transformer-based models for Hindi image captioning.