Garuda - Garba Rujukan Digital

Article Per Year (5 Year)

p-Index From 2021 - 2026

0.23

P-Index

This Author published in this journals

All Journal Journal of Intelligent Computing and Health Informatics (JICHI)

Eleni Vogiatzi

International Hellenic University

Author-ID : 9939996

Computer Science & IT Dentistry Electrical & Electronics Engineering Medicine & Pharmacology Public Health

Published : 1 Documents Claim Missing Document

Claim Missing Document

Articles

Resource Efficient Semantic Retrieval Pipeline via Generative Captioning and Text-to-Text Transformers for Bridging the Modality Gap Muhammad Firmansyah; Dhendra Marutho; Irwansyah Saputra; Eleni Vogiatzi
Journal of Intelligent Computing & Health Informatics Vol 6, No 2 (2025): September
Publisher : Universitas Muhammadiyah Semarang Press

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.26714/jichi.v6i2.19240

The rapid expansion of multimodal digital content necessitates the development of robust information retrieval systems capable of bridging the semantic gap between visual and textual data. However, contemporary cross- modal models, such as CLIP, impose significant computational demands, rendering them impractical for real-time deployment in resource-limited environments. To address this efficiency challenge, this study introduces a novel lightweight retrieval pipeline that reconceptualizes cross-modal retrieval as a text-to-text task through generative transformation. The proposed methodology employs the Bootstrapped Language-Image Pretraining (BLIP) model to distill visual features into rich textual descriptions, which are subsequently encoded into dense semantic vectors using the T5 transformer architecture. Extensive experiments conducted on the MSCOCO and Flickr30K datasets demonstrate that the proposed pipeline achieves a Semantic Average Recall (SAR@5) of 0.561, significantly surpassing traditional lexical (BM25) and dense (SBERT) baselines. Notably, while the computationally intensive CLIP model retains a slight advantage in absolute accuracy, our approach delivers approximately 90% of CLIP’s semantic performance while enhancing inference throughput by 2.1× and reducing GPU memory consumption by 62%. These findings confirm that generative semantic distillation offers a scalable, cost-effective alternative to end-to-end multimodal systems, particularly for latency-sensitive applications requiring high semantic fidelity.

Co-Authors Dhendra Marutho Irwansyah Saputra Muhammad Firmansyah

Title

Found 1 Documents
Search

Abstract

Title Search

Found 1 Documents Search

Abstract

Title

Found 1 Documents
Search