Claim Missing Document
Check
Articles

Found 1 Documents
Search

Generative Artificial Intelligence Label Reliability in Programming Assessment: Reliabilitas Label Kecerdasan Buatan Generatif pada Asesmen Algoritma Pemrograman Khairunnisa, Raissa Araminta; Pujianto, Utomo
Indonesian Journal of Innovation Studies Vol. 27 No. 1 (2026): January
Publisher : Universitas Muhammadiyah Sidoarjo

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.21070/ijins.v27i1.1881

Abstract

General Background: The integration of Generative AI in educational assessment enables rapid construction of large-scale question banks, particularly in programming education, yet raises concerns regarding content validity. Specific Background: In algorithm and programming domains, Generative AI models frequently assign Higher Order Thinking Skills and Lower Order Thinking Skills labels automatically, creating potential discrepancies with Bloom’s Taxonomy classifications. Knowledge Gap: Empirical evidence validating the reliability of AI-generated cognitive labels and comparing statistical and transformer-based classification methods on small, domain-specific Indonesian datasets remains limited. Aims: This study aims to audit the reliability of cognitive labels generated by the Gemini model through expert validation and to compare TF-IDF–SVM and IndoBERT–SVM classifiers under class-imbalanced conditions. Results: Expert validation revealed substantial mislabeling, with a claimed balanced dataset becoming skewed toward LOTS. Classification experiments using five-fold cross-validation showed that TF-IDF–SVM achieved a slightly higher macro F1-score than IndoBERT–SVM. Novelty: The study demonstrates that simple lexical representations with stemming can outperform transformer-based embeddings when data are limited and domain-specific. Implications: These findings emphasize the necessity of human validation in AI-generated assessments and support the use of lightweight statistical text classification for automated cognitive level evaluation in constrained educational contexts. Highlights • Generative AI cognitive labels showed substantial inconsistency after expert validation• Lexical feature representation yielded higher macro-level classification balance• Human-in-the-loop validation remained essential for programming assessment datasets Keywords HOTS; LOTS; Generative AI; Text Classification; TF-IDF