In large classroom environments, teachers often struggle to monitor each student’s facial expression throughout the learning process. Yet, facial expressions are important indicators of students’ emotional states and engagement, which, when detected in real time, can support a more adaptive learning experience. Most previous research on Facial Expression Recognition (FER) has relied on Convolutional Neural Networks (CNN), which tend to be limited in capturing global relationships between facial features. Additionally, many studies focus on model accuracy without evaluating their practical effectiveness in real classroom settings. This study aims to develop a facial expression recognition model that is both accurate and efficient for use in classroom contexts. A hybrid Vision Transformer (ViT) architecture is proposed, which combines MobileNetV3 for local feature extraction and a Vision Transformer for global context modeling. To reduce the number of tokens and computational cost, a Token Downsampling method is introduced within the transformer blocks. The model is trained using the FER2013 dataset and achieves a test accuracy of 71.24%, surpassing the baseline pretrained ViT model, which reached only 70.10%. Additionally, the Token Downsampling method improves inference speed. Furthermore, the model is tested on a custom dataset collected from students in a real classroom setting to evaluate its performance in practical implementation. Although the performance on the classroom dataset is not yet optimal, the results on FER2013 demonstrate the potential of this approach for further development toward real-time and accurate facial expression recognition in educational environments.