Visual accessibility in public spaces remains limited for individuals with visual impairments in Indonesia, despite technological advancements such as image captioning. This study aims to develop a custom dataset and a baseline CNN-LSTM image captioning model capable of describing sidewalk accessibility conditions in Indonesian language. The methodology includes collecting 748 annotated images from various Indonesian cities, with captions manually crafted to reflect accessibility features. The model employs DenseNet201 as the CNN encoder and LSTM as the decoder, with 70% of the data used for training and 30% for validation. Evaluation was conducted using BLEU and CIDEr metrics. Results show a BLEU-4 score of 0.27 and a CIDEr score of 0.56, indicating moderate alignment between model-generated and reference captions. While the absence of an attention mechanism and the limited dataset size constrain overall performance, the model demonstrates the ability to identify key elements such as tactile paving, signage, and pedestrian barriers. This study contributes to assistive technology development in a low-resource language context, providing foundational work for future research. Enhancements through data expansion, incorporation of attention mechanisms, and transformer-based models are recommended to improve descriptive richness and accuracy
Copyrights © 2025