Claim Missing Document
Check
Articles

Found 1 Documents
Search

Facial Expression Recognition of Students in Classroom Using Hybrid MobileNetV3-Vision Transformer with Token Downsampling Khaairi, Mochamad; Rasim, Rasim; Wihardi, Yaya
Brilliance: Research of Artificial Intelligence Vol. 5 No. 1 (2025): Brilliance: Research of Artificial Intelligence, Article Research May 2025
Publisher : Yayasan Cita Cendekiawan Al Khwarizmi

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.47709/brilliance.v5i1.6323

Abstract

In large classroom environments, teachers often struggle to monitor each student’s facial expression throughout the learning process. Yet, facial expressions are important indicators of students’ emotional states and engagement, which, when detected in real time, can support a more adaptive learning experience. Most previous research on Facial Expression Recognition (FER) has relied on Convolutional Neural Networks (CNN), which tend to be limited in capturing global relationships between facial features. Additionally, many studies focus on model accuracy without evaluating their practical effectiveness in real classroom settings. This study aims to develop a facial expression recognition model that is both accurate and efficient for use in classroom contexts. A hybrid Vision Transformer (ViT) architecture is proposed, which combines MobileNetV3 for local feature extraction and a Vision Transformer for global context modeling. To reduce the number of tokens and computational cost, a Token Downsampling method is introduced within the transformer blocks. The model is trained using the FER2013 dataset and achieves a test accuracy of 71.24%, surpassing the baseline pretrained ViT model, which reached only 70.10%. Additionally, the Token Downsampling method improves inference speed. Furthermore, the model is tested on a custom dataset collected from students in a real classroom setting to evaluate its performance in practical implementation. Although the performance on the classroom dataset is not yet optimal, the results on FER2013 demonstrate the potential of this approach for further development toward real-time and accurate facial expression recognition in educational environments.