This study developed and applied a multimodal Transformer model for student anxiety screening through video analysis of short interviews that included facial expressions, speech, and numerical data. Student anxiety is a problem that often affects mental health and academic performance, so early detection is important. The model combines three main data sources: facial expression features, speech analysis (including speech speed, intonation, and negative word count), and demographic information. The data used came from 500 students who participated in interviews lasting 20-40 seconds. The multimodal Transformer model was trained to classify anxiety levels into low, medium, and high categories, with evaluation using accuracy, precision, and recall metrics. The results showed that this model had a prediction accuracy of 88%, with a significant correlation between facial expressions and negative word counts on anxiety levels. Compared to the linear regression model used for comparison, the multimodal Transformer model shows better performance in detecting anxiety. These findings indicate that a multimodal approach using AI technology can improve accuracy and efficiency in student anxiety screening. This research opens up opportunities for the development of a more objective, non-invasive, and efficient video-based automated screening system, with potential applications in the field of mental health in higher education.
Copyrights © 2026