Understanding students’ visual attention direction is essential for evaluating their engagement during classroom learning. Head Pose Estimation (HPE) is an effective method for identifying attention focus, however, its application in real classroom settings is often hindered by low image quality and varied student seating positions, which makes regression-based methods for predicting facial landmarks or Euler angles suboptimal. This study adopts an image-based classification approach as an alternative and proposes a modification of the EfficientNetV2-S architecture by integrating Seat Position Embedding (SPE) as spatial context to improve the accuracy of head pose classification. The dataset was developed from direct classroom recordings and processed into 4,574 head pose images with five directional labels (up, down, front, right, left). Several CNN architectures were evaluated with and without SPE. The results show that the proposed model with SPE achieved an accuracy of 83.25%, surpassing the baseline model’s accuracy of 82.53%. This approach has proven effective in reducing visual ambiguity and providing a more accurate interpretation of students' attention.
Copyrights © 2025