Buletin Ilmiah Sarjana Teknik Elektro
Vol. 7 No. 4 (2025): December

Bi-LSTM and Attention-based Approach for Lip-To-Speech Synthesis in Low-Resource Languages: A Case Study on Bahasa Indonesia

Setyaningsih, Eka Rahayu (Unknown)
Handayani, Anik Nur (Unknown)
Irianto, Wahyu Sakti Gunawan (Unknown)
Kristian, Yosi (Unknown)



Article Info

Publish Date
23 Oct 2025

Abstract

Lip-to-speech synthesis enables the transformation of visual information, particularly lip movements, into intelligible speech. This technology has gained increasing attention due to its potential in assistive communication for individuals with speech impairments, audio restoration in cases of missing or corrupted speech signals, and enhancement of communication quality in noisy or bandwidth-limited environments. However, research on low-resource languages, such as Bahasa Indonesia, remains limited, primarily due to the absence of suitable corpora and the unique phonetic structures of the language. To address this challenge, this study employs the LUMINA dataset, a purpose-built Indonesian audio-visual corpus comprising 14 speakers with diverse syllabic coverage. The main contribution of this work is the design and evaluation of an Attention-Augmented Bi-LSTM Multimodal Autoencoder, implemented as a two-stage parallel pipeline: (1) an audio autoencoder trained to learn compact latent representations from Mel-spectrograms, and (2) a visual encoder based on EfficientNetV2-S integrated with Bi-LSTM and multi-head attention to predict these latent features from silent video sequences. The experimental evaluation yields promising yet constrained results. Objective metrics yielded maximum scores of PESQ 1.465, STOI 0.7445, and ESTOI 0.5099, which are considerably lower than those of state-of-the-art English systems (PESQ > 2.5, STOI > 0.85), indicating that intelligibility remains a challenge. However, subjective evaluation using Mean Opinion Score (MOS) demonstrates consistent improvements: while baseline LSTM models achieve only 1.7–2.5, the Bi-LSTM with 8-head attention attains 3.3–4.0, with the highest ratings observed in female multi-speaker scenarios. These findings confirm that Bi-LSTM with attention improves over conventional baselines and generalizes better in multi-speaker contexts. The study establishes a first baseline for lip-to-speech synthesis in Bahasa Indonesia and underscores the importance of larger datasets and advanced modeling strategies to further enhance intelligibility and robustness in low-resource language settings.

Copyrights © 2025






Journal Info

Abbrev

biste

Publisher

Subject

Electrical & Electronics Engineering

Description

Buletin Ilmiah Sarjana Teknik Elektro (BISTE) adalah jurnal terbuka dan merupakan jurnal nasional yang dikelola oleh Program Studi Teknik Elektro, Fakultas Teknologi Industri, Universitas Ahmad Dahlan. BISTE merupakan Jurnal yang diperuntukkan untuk mahasiswa sarjana Teknik Elektro. Ruang lingkup ...