A parallel convolutional neural network–long short-term memory (CNN–LSTM) architecture is introduced for voice command recognition, designed to simultaneously extract spatial and temporal features from speech signals. Conventional serial architectures process these components sequentially, which can lead to the loss of temporal information after CNN-based spatial compression. This study aimed to improve recognition performance by preserving complementary spectral and temporal representations through parallel feature modeling. In the proposed approach, the CNN branch extracted spectral features from Mel-frequency cepstral coefficients (MFCCs), while the LSTM branch independently modeled long-term temporal dependencies from the same input stream. The outputs from both branches were fused through concatenation to form a comprehensive acoustic representation enhancing discrimination between phonetically similar commands. The model was trained and evaluated using a dataset containing eight classes of spoken commands. During training, the proposed model achieved a loss of 0.0186 and an accuracy of 99.87%, indicating effective learning. On the validation and test datasets, the model reached an accuracy of 89.16%, demonstrating stable convergence and consistent generalization performance. Evaluation using precision, recall, and F1 score metrics confirmed balanced recognition across classes, with particularly high accuracy for commands such as “stop,” “right,” and “yes,” while “go” and “no” showed lower accuracy due to acoustic similarity. In conclusion, the proposed parallel CNN–LSTM architecture effectively integrates convolutional and recurrent learning, resulting in improved recognition accuracy and robust performance with strong potential for real-time voice control and embedded applications.
Copyrights © 2026