Indonesian Sign Language (BISINDO) is the primary communication medium for the deaf community, yet limited public understanding often leads to communication barriers. Previous sign language recognition studies have generally been conducted offline, lacked real-time web integration, and produced only text-based outputs without multimodal interaction. To address these limitations, this study proposes a real-time web-based BISINDO translator system using a hybrid Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) model, integrated with a Text-to-Speech (TTS) feature. The novelty of this research lies in the combination of CNN for spatial feature extraction and LSTM for temporal sequence learning within a fully deployed web application framework (NzSignify), enabling real-time end-to-end sign language translation with both text and voice output. The dataset consists of primary video recordings from three subjects, covering 11 gesture classes with 1,000 grayscale frames per class at a resolution of 100×89 pixels. The proposed model is implemented using a React.js and Node.js-based system to support real-time inference. Experimental results show that the hybrid CNN-LSTM model achieves a classification accuracy of 96% based on Confusion Matrix evaluation. In real-time testing, an 80% confidence threshold effectively filters misclassified gestures and improves translation reliability into text and speech outputs. Compared to previous studies that mainly rely on standalone CNN or traditional machine learning methods with offline processing, the proposed approach demonstrates improved capability in capturing both spatial and temporal features of sign gestures as well as supporting real-time deployment. These findings indicate that the developed system provides a more practical, accurate, and interactive solution for BISINDO translation, enhancing communication accessibility between deaf and hearing communities through a real-time multimodal platform.