Garuda - Garba Rujukan Digital

Indonesian Journal of Data and Science

Vol. 6 No. 3 (2025): Indonesian Journal of Data and Science

Sharief, Tirta Chiantalia (Unknown)
Hazriani (Unknown)
Syamsul (Unknown)
Anas (Unknown)
Yuyun (Unknown)

Publish Date
31 Dec 2025

This study examines the development of a Visual Question Answering (VQA) system to extract information from images of Makassar culinary specialties by combining the Vision Transformer (ViT) and Cahya_GPT-2 models. The main objective is to integrate visual and natural language understanding so that computers can recognize visual objects (food images) and generate relevant text descriptions. The research method uses an experimental approach with a fine-tuning process of the pre-trained ViT model as a visual encoder and Cahya_GPT-2 as a text decoder. The dataset used includes images of Makassar culinary specialties such as Coto, Konro, Pisang Epe, Barongko, and Jalangkote with question and answer (QnA) annotations. Evaluation is carried out using the ROUGE metric to assess the semantic match between the model's answers and the actual answers. The results show that the developed multimodal model is able to accurately understand the image context with an average ROUGE-L score of 0.63, indicating a good level of closeness between the model's answers and the annotations. In conclusion, the combination of ViT and Cahya_GPT-2 can be an effective approach for natural language-based visual information extraction systems, especially in the Indonesian local culinary domain

Citation Download

EndNote, Reference Manager, ProCite

Latex, Jabref

Check in Google Scholar

Journal Info

Indonesian Journal of Data and Science

Website

Abbrev

ijodas

Publisher

yocto brain

Subject

Computer Science & IT Decision Sciences, Operations Research & Management Mathematics

Description

IJODAS provides online media to publish scientific articles from research in the field of Data Science, Data Mining, Data Communication, Data Security and Data ...

Article Info

Abstract

Information Extraction from Makassar Culinary Images Using Vision Transformers and Cahya GPT-2 (Visual Question Answering Case Study

Article Info

Abstract