Monitoring student behavior during classroom learning is important for supporting learning quality and teacher performance. This study presents a pilot comparison between YOLOv8 and YOLOv11 for detecting student classroom behaviors from CCTV images. Six elementary behaviors are consistently defined and used throughout the work: lookup, raise-hand, read, stand, turn-head, and write. The available SCB dataset contains 4,934 labeled images, but this study deliberately uses a front-facing subset of 100 images that best represent clear posture and behavior. After augmentation, the dataset grows to 220 images, split into 180 training, 30 validation, and 10 testing images. Both models are trained for 25 epochs on a T4 GPU with comparable configurations. At the detector level, YOLOv11 achieves higher mean average precision (mAP) of 42.9% compared to 28.9% for YOLOv8. At the behavior level, overall classification accuracy on the test set is 43.3% for YOLOv8 and 37.5% for YOLOv11. These results indicate a trade-off: YOLOv11 provides stronger bounding-box detection performance, while YOLOv8 produces slightly more stable behavior-level predictions on this very small and imbalanced dataset. The study emphasizes that these findings are exploratory baselines rather than definitive benchmarks, because the dataset is small and no statistical significance testing is performed. Future work must use a larger portion of the SCB dataset, more balanced class distributions, repeated experiments, and statistical analysis to obtain more robust conclusion.
Copyrights © 2026