Claim Missing Document
Check
Articles

Found 2 Documents
Search

Auditable Automated Essay Scoring and Formative Feedback: A Rubric-Grounded Pipeline for Secondary and Higher Education Qi Xin
Journal of Applied Artificial Intelligence in Education Vol 2, No 1 (2026): July 2026
Publisher : Academic Bright Collaboration

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.66053/jaaie.v2i1.348

Abstract

Automated essay scoring in education is increasingly expected to do more than reproduce human holistic scores; classroom use also demands rubric-aligned feedback, transparent evidence, and a way to route uncertain cases to teachers. In this study, “LLM-ready” refers to a system that outputs structured score evidence, weak-trait signals, and document-level anchors that can later be verbalized by a language model without changing the underlying decision trace. This study aimed to evaluate whether a rubric-grounded, LLM-ready pipeline can achieve competitive scoring accuracy while also generating auditable formative feedback and a teacher-controllable review signal. The evaluation used the public ASAP corpus of 12,976 essays across eight prompts and prompt-wise five-fold cross-validation. Four holistic scorers were compared: length-only, rubric forest, prompt-adaptive centroid regressor (PACR), and the final RG-Score ensemble with trait grounding, isotonic calibration, and audit control. Auxiliary analytic scoring was examined on Prompts 2 and 7–8, and feedback experiments were conducted on all 2,292 essays from Prompts 7 and 8. PACR obtained the highest macro QWK of 0.739, while RG-Score reached 0.738 and provided a calibrated, auditable path to feedback. The prompt-level QWK for RG-Score ranged from 0.66 to 0.82, with particularly strong gains on Prompts 6 and 7. Auxiliary analytic scoring yielded QWK values of 0.623 for Prompt 2 domain2, 0.604 on average for Prompt 7 traits, and 0.506 on average for Prompt 8 traits. The rubric-grounded evidence feedback template achieved a Trait Recall@2 of 0.829, a valid evidence rate of 0.912, and an auditability index of 0.893 on Prompts 7 and 8. These findings support rubric-grounded AES as a practical assessment-support approach for secondary-school writing and as a structured foundation for higher-education formative feedback workflows, while also indicating that weaker trait models should be treated as advisory rather than fully autonomous
Log Anomaly Detection with Conformal Alert Control and Evidence-Grounded Incident Ticket Generation Qi Xin
Aviation Electronics, Information Technology, Telecommunications, Electricals, and Controls (AVITEC) Vol 8, No 2 (2026): August
Publisher : Institut Teknologi Dirgantara Adisutjipto

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.28989/avitec.v8i2.3974

Abstract

Operational logs are a primary source of evidence for reliability engineering, incident response, and security operations, but log anomaly detection is useful only when scores can be translated into controlled alerts and auditable incident evidence. This paper presents a reproducible end-to-end AIOps pipeline that normalizes raw logs into templates, aggregates them into sliding windows, scores anomalies with representative detectors, calibrates alerts with conformal prediction, and generates evidence-grounded incident tickets. The revised evaluation includes BGL_2k and two additional public sequence benchmarks, HDFS and OpenStack, and adds representative LogAnomaly-style and LogBERT-lite baselines to the original TF-IDF+LR, Isolation Forest, DeepLog-style LSTM, and Transformer comparisons. On BGL_2k, Isolation Forest provides the best ranking performance among the original four detectors (test PR-AUC = 0.750), while the additional HDFS experiment shows that the masked-context LogBERT-lite baseline obtains the strongest sequence-level result (PR-AUC = 0.947, F1 = 0.905). OpenStack remains difficult because the available normal training sample is very small, producing low F1 across all added baselines. We also report inference latency, throughput, memory footprint, conformal alpha sensitivity, window-size sensitivity, model-strategy ablations, and structured false-positive/false-negative patterns. The results should be interpreted as reproducible operational validation of the detection-calibration-ticket workflow rather than a claim of state-of-the-art detector accuracy. The pipeline demonstrates how calibrated scores and template-level evidence can support practical alert control and ITSM-ready ticket generation.