Effective communication is fundamental for social interaction, yet individuals with hearing impairments often face significant barriers. Indonesian Sign Language (BISINDO) is a vital communication tool for the deaf community in Indonesia. However, limited public understanding of BISINDO creates communication barriers, which necessitate an accurate automatic recognition system. This research aims to investigate the efficacy of the Vision Transformer (ViT) model, a state-of-the-art deep learning architecture, for classifying static BISINDO alphabet images, exploring its potential to overcome the limitations of previous approaches through robust feature extraction. The methodology involved utilizing a dataset of 26 BISINDO alphabet classes, which underwent comprehensive preprocessing, including class balancing via augmentation and image normalization. The Google/vit-base-patch16-224-in21k ViT model was adapted with a custom classification head and trained using a two-phase strategy: initial feature extraction with a frozen backbone, followed by full network fine-tuning. The fine-tuned Vision Transformer model demonstrated exceptional performance on the unseen test set, achieving an accuracy of 99.77% (95% CI: 99.55%–99.99%), precision of 99.77%, recall of 99.72%, and a weighted F1-score of 0.9977, significantly surpassing many previously reported methods. The findings compellingly confirm that the ViT model is a highly effective and robust solution for BISINDO alphabet image classification, underscoring the potential of advanced Transformer-based architectures in developing accurate assistive communication technologies to benefit the Indonesian deaf and hard-of-hearing community.
Copyrights © 2025