Although facial expression recognition (FER) using deep learning has received increasing attention in prior studies, research specifically addressing the comparative effectiveness of sequential modeling on static image data remains limited. This study aims to evaluate and compare the performance of a pure Convolutional Neural Network (CNN) model and a hybrid CNN–Long Short-Term Memory (CNN-LSTM) model in classifying seven basic facial expressions using the static FER2013 dataset. A quantitative experimental approach with a comparative study design was employed, utilizing the publicly available FER2013 dataset and two custom deep learning architectures. Data were obtained from FER2013 and model performance was evaluated using accuracy, precision, recall, F1-score, and AUC-ROC metrics. The findings indicate that the pure CNN model significantly outperformed the CNN-LSTM model, achieving a testing accuracy of 63.25% compared to 46.82% for the hybrid model; the CNN provided strong discrimination for visually distinct classes but continued to struggle with visually similar expressions. These results contribute to the theoretical development of deep learning architecture selection and expand understanding of the application of sequence models to static data. The study concludes that data characteristics (static versus temporal) play a crucial role in determining model effectiveness, and that for static datasets such as FER2013, a pure CNN constitutes the more appropriate choice. The implications of this research include theoretical contributions to the growing literature on deep learning-based FER and practical recommendations for developers to prioritize CNN architectures for non-temporal image classification tasks, while also highlighting opportunities for future research on transfer learning and attention mechanisms to better capture subtle expression nuances.
Copyrights © 2026