This paper presents a novel model that integrates spatial features from residual blocks and temporal features from FFT, alongside a sophisticated RNN architecture comprising BiLSTM, gated recurrent units (GRU) layers, and multi-head attention. Achieving nearly 99% accuracy on both WLASL and INCLUDE datasets, this model outperforms standard CNN pretrained models in feature extraction. Notably, the BiLSTM and GRU combination proves superior to other combinations such as LSTM and GRU. The BLEU score analysis further validates the model's efficacy, with scores of 0.51 and 0.54 on the WLASL and INCLUDE datasets, respectively. These results affirm the model's proficiency in capturing intricate spatial and temporal nuances inherent in sign language gestures, enhancing accessibility and communication for the deaf and hard-of-hearing communities. The comparison highlights the superiority of this paper's proposed model over standard approaches, emphasizing the significance of the integrated architecture. Continued refinement and optimization hold promise for further augmenting the model's performance and applicability in real-world scenarios, contributing to inclusive communication environments.
Copyrights © 2025