Automatic Speech Recognition (ASR) faced challenges in accuracy and noise robustness, particularly in Bahasa Indonesia. This research addressed the limitations of single feature extraction methods, such as Mel-Frequency Cepstral Coefficients (MFCC), which were sensitive to noise, and Relative Spectral Transform - Perceptual Linear Predictive (RASTA-PLP), which was less effective in frequency representation, by proposing a hybrid approach that combined both techniques using Long Short-Term Memory (LSTM) models. MFCC enhanced spectral accuracy, while RASTA-PLP improved noise robustness, resulting in a more adaptive and informative acoustic representation. The evaluation demonstrated that the hybrid method outperformed single and non-extraction approaches, achieving a Character Error Rate (CER) of 0.5245 on clean data and 0.8811 on noisy data, as well as a Word Error Rate (WER) of 0.9229 on clean data and 1.0015 on noisy data. Although the hybrid approach required longer training times and higher memory usage, it remained stable and effective in reducing transcription errors. These findings suggested that the hybrid method was an optimal solution for Indonesian speech recognition in various acoustic conditions.
Copyrights © 2025