Visual impairment is a global issue with significant impacts on the mobility and safety of individuals, especially in urban environments. Artificial intelligence solutions, such as image captioning, promise assistance for people with visual impairments to aid their daily activities. However, the field of image captioning in this context still has performance limitations. To address this, this study proposes a hybrid method combining image feature extraction from VGG16, ResNet50, and YOLO on the encoder side with LSTM and BiGRU on the decoder side to generate descriptions that have proven to enhance model performance on the Flickr8k dataset in previous research. By adapting this method to the Visual Assistance dataset, incorporating image augmentation through a combination of rotation and zoom, and applying transfer learning to address the dataset size limitation, this study successfully improved the model"™s performance in supporting the activities of visually impaired pedestrians in urban environments. Evaluation results showed significant improvements in several evaluation metrics. Overall, this model shows improvement compared to previous research, where Sharma et al. (2022) reported that the InceptionV3-BiLSTM model with Adaptive Attention achieved a BLEU-4 score of only 0.266 on the Visual Assistance dataset. This study achieved a 60.53% increase in BLEU-4 score compared to previous research on the Visual Assistance dataset. Overall, this study provides a positive contribution to developing more effective and accurate solutions for visually impaired navigation users in urban environments.
Copyrights © 2025