This research presents a Transformer-based encoder-decoder model for medical image captioning that incorporates semantic medical knowledge through Concept Unique Identifiers (CUIs) from the Unified Medical Language System (UMLS). The proposed architecture employs a Swin Transformer as the visual encoder and GPT-2 as the language decoder, with CUI integration applied during both caption preprocessing and decoding. Experiments were conducted on the ROCOv2 dataset under two scenarios: baseline (raw captions) and enhanced (CUI-enriched captions). Quantitative evaluation using BLEU, ROUGE, CIDEr, and BERT-based metrics demonstrates that the CUI-integrated model outperforms several baselines, including CNN-LSTM, ViT-BioMedLM, and DeepSeek-VL, achieving a BLEU-1 score of 0.371, ROUGE-L of 0.305, CIDEr of 0.275, and PubMedBERTScore-F1 of 0.893. These results represent a 20.1% improvement in BLEU-1 and a 39.9% increase in ROUGE-L compared to the best-performing model before caption preprocessing (ViT-GPT2 with BLEU-1 = 0.309, ROUGE-L = 0.218). Qualitative assessment by expert radiologists further confirms enhanced diagnostic accuracy, descriptive completeness, and clinical relevance. This study introduces a novel integration of medical semantic knowledge into captioning models, offering a scalable solution for clinical decision support in resource-limited settings such as Indonesia.
Copyrights © 2026