Scientific Journal of Informatics
Vol. 12 No. 3: August 2025

Comparative Analysis of High School Student and AI-Generated Essays Using IndoBERT and Linguistic Features

Adani, Muhammad Harits Shofwan (Unknown)
Rausanfita, Alqis (Unknown)
Mustaqim, Tanzilal (Unknown)



Article Info

Publish Date
04 Oct 2025

Abstract

Purpose: The purpose of this study is to address the growing challenge of distinguishing between essays written by humans and essays generated by AI, particularly in the context of high school education in Indonesia. This study aims to analyze the semantic and linguistic differences between student-written and ChatGPT-generated in Indonesian language. Methods: The study employs an IndoBERT-based semantic model trained with triplet loss to generate paragraph-level embeddings, allowing the measurement of semantic similarity within and between essay classes. Additionally, linguistic features such as lexical diversity, word count, modal usage, and stopword ratio were extracted to capture stylistic and structural differences. These three key features are combined and used as input to a neural network classifier. Result: The IndoBERT-based semantic model successfully grouped student-written and ChatGPT-generated essays into distinct clusters. The similarity scores within student essays ranged from 0.7 to 0.9, while the similarity between classes was mostly negative with a few outliers, reflecting the cosine similarity metric used in this study, which has a range of -1 to 1. The classification model showed a 90.55% accuracy and an AUC of 0.9999 when evaluated on the independent test set defined in the Data Preparation stage. These results suggest that student-written and ChatGPT-generated essays form distinct semantic clusters. Students’ essays show more linguistic diversity, while ChatGPT essays show consistency in the coherence and formality aspects of the essays. Novelty: This study provides empirical insights of semantic similarities and linguistic features to differentiate between human and AI-generated essays in the Indonesian language. It contributes to supporting academic integrity efforts and highlighting the need for further research across different writing models and contexts.

Copyrights © 2025






Journal Info

Abbrev

sji

Publisher

Subject

Computer Science & IT Control & Systems Engineering Decision Sciences, Operations Research & Management Electrical & Electronics Engineering Engineering

Description

Scientific Journal of Informatics (p-ISSN 2407-7658 | e-ISSN 2460-0040) published by the Department of Computer Science, Universitas Negeri Semarang, a scientific journal of Information Systems and Information Technology which includes scholarly writings on pure research and applied research in the ...