This research develops a web-based Sundanese speech translation system incorporating visual enhancement through Convolutional Neural Network (CNN). The primary challenge is insufficient accuracy in audio-only Automatic Speech Recognition (ASR) for low-resource languages under noisy conditions. The solution integrates fine-tuned Whisper Medium for transcription, CNN-based lip-reading, and attention-weighted audio-visual fusion. Training used OpenSLR36 Sundanese corpus with ~35,000 samples from 175,324 available instances (subset due to memory constraints). Optimization was executed on RunPod using NVIDIA RTX 4090 GPU (24GB VRAM) for 5,000 iterations (~11 hours). Results show the optimized model achieves Word Error Rate (WER) of 2.45% at optimal checkpoint (iteration 3500), improving 7.37 percentage points from baseline (9.82% at iteration 500). This performance approaches state-of-the-art by Raharjo & Zahra (2025) reporting 2.03% WER using Whisper Small. The visual module comprises three-layer CNN producing 512-dimensional features with MediaPipe facial detection. Black-box testing validates functional compliance, while responsive interface ensures cross-device compatibility. This work advances Sundanese preservation through accessible translation with competitive accuracy.
Copyrights © 2026