Claim Missing Document
Check
Articles

Found 1 Documents
Search

Resource Efficient Semantic Retrieval Pipeline via Generative Captioning and Text-to-Text Transformers for Bridging the Modality Gap Muhammad Firmansyah; Dhendra Marutho; Irwansyah Saputra; Eleni Vogiatzi
Journal of Intelligent Computing & Health Informatics Vol 6, No 2 (2025): September
Publisher : Universitas Muhammadiyah Semarang Press

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.26714/jichi.v6i2.19240

Abstract

The rapid expansion of multimodal digital content necessitates the development of robust information retrieval systems capable of bridging the semantic gap between visual and textual data. However, contemporary cross- modal models, such as CLIP, impose significant computational demands, rendering them impractical for real-time deployment in resource-limited environments. To address this efficiency challenge, this study introduces a novel lightweight retrieval pipeline that reconceptualizes cross-modal retrieval as a text-to-text task through generative transformation. The proposed methodology employs the Bootstrapped Language-Image Pretraining (BLIP) model to distill visual features into rich textual descriptions, which are subsequently encoded into dense semantic vectors using the T5 transformer architecture. Extensive experiments conducted on the MSCOCO and Flickr30K datasets demonstrate that the proposed pipeline achieves a Semantic Average Recall (SAR@5) of 0.561, significantly surpassing traditional lexical (BM25) and dense (SBERT) baselines. Notably, while the computationally intensive CLIP model retains a slight advantage in absolute accuracy, our approach delivers approximately 90% of CLIP’s semantic performance while enhancing inference throughput by 2.1× and reducing GPU memory consumption by 62%. These findings confirm that generative semantic distillation offers a scalable, cost-effective alternative to end-to-end multimodal systems, particularly for latency-sensitive applications requiring high semantic fidelity.