Garuda - Garba Rujukan Digital

Jurnal Nasional Teknik Elektro dan Teknologi Informasi

Vol 15 No 1: Februari 2026

Santoso (Department of Electrical Engineering, Faculty of Intelligent Electrical and Informatics Technology, Institut Teknologi Sepuluh Nopember, Surabaya, Jawa Timur 60111, Indonesia)
Tri Arief Sardjono (Department of Electrical Engineering, Faculty of Intelligent Electrical and Informatics Technology, Institut Teknologi Sepuluh Nopember, Surabaya, Jawa Timur 60111, Indonesia)
Djoko Purwanto (Department of Electrical Engineering, Faculty of Intelligent Electrical and Informatics Technology, Institut Teknologi Sepuluh Nopember, Surabaya, Jawa Timur 60111, Indonesia)

Publish Date
27 Feb 2026

A parallel convolutional neural network–long short-term memory (CNN–LSTM) architecture is introduced for voice command recognition, designed to simultaneously extract spatial and temporal features from speech signals. Conventional serial architectures process these components sequentially, which can lead to the loss of temporal information after CNN-based spatial compression. This study aimed to improve recognition performance by preserving complementary spectral and temporal representations through parallel feature modeling. In the proposed approach, the CNN branch extracted spectral features from Mel-frequency cepstral coefficients (MFCCs), while the LSTM branch independently modeled long-term temporal dependencies from the same input stream. The outputs from both branches were fused through concatenation to form a comprehensive acoustic representation enhancing discrimination between phonetically similar commands. The model was trained and evaluated using a dataset containing eight classes of spoken commands. During training, the proposed model achieved a loss of 0.0186 and an accuracy of 99.87%, indicating effective learning. On the validation and test datasets, the model reached an accuracy of 89.16%, demonstrating stable convergence and consistent generalization performance. Evaluation using precision, recall, and F1 score metrics confirmed balanced recognition across classes, with particularly high accuracy for commands such as “stop,” “right,” and “yes,” while “go” and “no” showed lower accuracy due to acoustic similarity. In conclusion, the proposed parallel CNN–LSTM architecture effectively integrates convolutional and recurrent learning, resulting in improved recognition accuracy and robust performance with strong potential for real-time voice control and embedded applications.

Citation Download

EndNote, Reference Manager, ProCite

Latex, Jabref

Check in Google Scholar

Journal Info

Jurnal Nasional Teknik Elektro dan Teknologi Informasi

Website

Abbrev

JNTETI

Publisher

Universitas Gadjah Mada

Subject

Computer Science & IT Control & Systems Engineering Electrical & Electronics Engineering Energy Engineering

Description

Topics cover the fields of (but not limited to): 1. Information Technology: Software Engineering, Knowledge and Data Mining, Multimedia Technologies, Mobile Computing, Parallel/Distributed Computing, Artificial Intelligence, Computer Graphics, Virtual Reality 2. Power Systems: Power Generation, ...

Article Info

Abstract

Voice Command Recognition Using CNN-LSTM Parallel Architecture

Article Info

Abstract