Indonesian Journal of Innovation Studies
Vol. 27 No. 1 (2026): January

Generative Artificial Intelligence Label Reliability in Programming Assessment: Reliabilitas Label Kecerdasan Buatan Generatif pada Asesmen Algoritma Pemrograman

Khairunnisa, Raissa Araminta (Unknown)
Pujianto, Utomo (Unknown)



Article Info

Publish Date
04 Jan 2026

Abstract

General Background: The integration of Generative AI in educational assessment enables rapid construction of large-scale question banks, particularly in programming education, yet raises concerns regarding content validity. Specific Background: In algorithm and programming domains, Generative AI models frequently assign Higher Order Thinking Skills and Lower Order Thinking Skills labels automatically, creating potential discrepancies with Bloom’s Taxonomy classifications. Knowledge Gap: Empirical evidence validating the reliability of AI-generated cognitive labels and comparing statistical and transformer-based classification methods on small, domain-specific Indonesian datasets remains limited. Aims: This study aims to audit the reliability of cognitive labels generated by the Gemini model through expert validation and to compare TF-IDF–SVM and IndoBERT–SVM classifiers under class-imbalanced conditions. Results: Expert validation revealed substantial mislabeling, with a claimed balanced dataset becoming skewed toward LOTS. Classification experiments using five-fold cross-validation showed that TF-IDF–SVM achieved a slightly higher macro F1-score than IndoBERT–SVM. Novelty: The study demonstrates that simple lexical representations with stemming can outperform transformer-based embeddings when data are limited and domain-specific. Implications: These findings emphasize the necessity of human validation in AI-generated assessments and support the use of lightweight statistical text classification for automated cognitive level evaluation in constrained educational contexts. Highlights • Generative AI cognitive labels showed substantial inconsistency after expert validation• Lexical feature representation yielded higher macro-level classification balance• Human-in-the-loop validation remained essential for programming assessment datasets Keywords HOTS; LOTS; Generative AI; Text Classification; TF-IDF

Copyrights © 2026






Journal Info

Abbrev

ijins

Publisher

Subject

Computer Science & IT Education Engineering Law, Crime, Criminology & Criminal Justice

Description

Indonesian Journal of Innovation Studies (IJINS) is a peer-reviewed journal published by Universitas Muhammadiyah Sidoarjo four times a year. This journal provides immediate open access to its content on the principle that making research freely available to the public supports a greater global ...