Automated Short Answer Grading (ASAG) is crucial for scalable feedback, but applying it to low-resource languages like Indonesian is challenging. Modern Large Language Models (LLMs) severely overfit small, specialized educational datasets, limiting utility. This study compares nine traditional machine learning models against two fine-tuning strategies for Gemma-3-1b-it on an expanded Indonesian ASAG dataset (n=220): (a) standard fine-tuning predicting only scores, and (b) a proposed reasoning-guided approach where the model first generates a score rationale using knowledge distillation before predicting the score. The reasoning-guided model (Gemma-3-1b-ASAG-ID-Reasoning) achieved state-of-the-art performance (QWK 0.7791; Spearman’s 0.8276), significantly surpassing the best traditional model in this study (SVR, QWK 0.6952). This work advances foundational LSA-based approaches for this task by introducing a more robust methodology and evaluation framework. Crucially, standard fine-tuning (Gemma-3-1b-ASAG-ID) suffered catastrophic overfitting (QWK 0.7279), indicated by near-perfect training but poor test scores. While the reasoning-guided LLM showed superior accuracy, it required over 35 times more inference time. Results demonstrate that distilled reasoning acts as a powerful regularizer, compelling the LLM to learn underlying grading logic rather than memorizing pairs, establishing a viable method for high-performance ASAG in data-scarce environments despite computational trade-offs.
Copyrights © 2025