Santoso
Department of Electrical Engineering, Faculty of Intelligent Electrical and Informatics Technology, Institut Teknologi Sepuluh Nopember, Surabaya, Jawa Timur 60111, Indonesia

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Voice Command Recognition Using CNN-LSTM Parallel Architecture Santoso; Tri Arief Sardjono; Djoko Purwanto
Jurnal Nasional Teknik Elektro dan Teknologi Informasi Vol 15 No 1: Februari 2026
Publisher : This journal is published by the Department of Electrical and Information Engineering, Faculty of Engineering, Universitas Gadjah Mada.

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.22146/jnteti.v15i1.23855

Abstract

A parallel convolutional neural network–long short-term memory (CNN–LSTM) architecture is introduced for voice command recognition, designed to simultaneously extract spatial and temporal features from speech signals. Conventional serial architectures process these components sequentially, which can lead to the loss of temporal information after CNN-based spatial compression. This study aimed to improve recognition performance by preserving complementary spectral and temporal representations through parallel feature modeling. In the proposed approach, the CNN branch extracted spectral features from Mel-frequency cepstral coefficients (MFCCs), while the LSTM branch independently modeled long-term temporal dependencies from the same input stream. The outputs from both branches were fused through concatenation to form a comprehensive acoustic representation enhancing discrimination between phonetically similar commands. The model was trained and evaluated using a dataset containing eight classes of spoken commands. During training, the proposed model achieved a loss of 0.0186 and an accuracy of 99.87%, indicating effective learning. On the validation and test datasets, the model reached an accuracy of 89.16%, demonstrating stable convergence and consistent generalization performance. Evaluation using precision, recall, and F1 score metrics confirmed balanced recognition across classes, with particularly high accuracy for commands such as “stop,” “right,” and “yes,” while “go” and “no” showed lower accuracy due to acoustic similarity. In conclusion, the proposed parallel CNN–LSTM architecture effectively integrates convolutional and recurrent learning, resulting in improved recognition accuracy and robust performance with strong potential for real-time voice control and embedded applications.