The increasing demand for efficient multimodal information retrieval has driven significant research into bridging visual and textual data. While sophisticated models like CLIP offer state-of-the-art semantic alignment, their substantial computational requirements present challenges for deployment in resource-constrained environments. This study introduces a lightweight retrieval framework that leverages the BLIP image captioning model to transform image data into rich textual descriptions, effectively reframing cross-modal retrieval as a text-to-text task. We systematically evaluated three retrieval models BM25, SBERT, and T5 on caption-transformed MSCOCO and Flickr30K datasets, utilizing both classical metrics (Recall@5, mAP) and semantic-aware metrics (SAR@5, Semantic mAP). Experimental results demonstrate that T5 achieves superior semantic performance (SAR@5 = 0.561, Semantic mAP = 0.524), surpassing SBERT (SAR@5 = 0.524) and outperforming the lexical BM25 baseline (SAR@5 = 0.312). Notably, the proposed BLIP+T5 pipeline attains 88% of CLIP’s semantic accuracy while reducing inference latency by approximately 60% and decreasing GPU memory consumption by over 60%. These findings underscore the potential of caption-based retrieval frameworks as scalable, cost-effective alternatives to computationally intensive multimodal systems, especially in latency-sensitive and resource-limited scenarios. Future work will explore fine-tuning strategies, domain-adapted semantic metrics, and robustness under real-world conditions to further advance retrieval effectiveness.