Essay assessment in online learning requires significant time, effort, and consistency, which can be challenging to maintain when conducted manually. This study explores the use of the large language model GPT-3.5 Turbo as the core of an automated essay scoring system for online learning platforms. Employing a Research and Development (R&D) approach with the ADDIE development model—comprising Analysis, Design, Development, Implementation, and Evaluation phases—the research adopts the Cross-Industry Standard Process for Data Mining (CRISP-DM) framework for its methodology. The automated essay scoring system utilizing Prompt 4 demonstrated exceptionally high accuracy and reliability. The model achieved an accuracy of 94.3%, an F1-Score of 0.955, and a Cohen’s Kappa value of 0.878. This high Kappa value indicates a very strong agreement between AI-generated assessments and the gold standard validated by educators, surpassing the initial inter-rater agreement among educators themselves, which was only 0.1157. The superior performance of Prompt 4 is also confirmed by the lowest Mean Absolute Error (MAE) of 30.54 and the highest Area Under the Curve (AUC) of 0.956.
Copyrights © 2025