This study evaluates the quality of Arabic test items in madrasah assessments using a quantitative approach based on Multidimensional Item Response Theory (MIRT). The sample comprised 321 twelfth-grade students from MAN 1 Surakarta, purposively selected because the institution implements systematic and independent assessments. Data were obtained from student responses to the final Arabic examination in the 2022/2023 academic year. Exploratory Factor Analysis (EFA) was first conducted to identify the dimensional structure of the test, using the criteria KMO > 0.60 and a significant Bartlett’s Test of Sphericity (p < 0.05). Factor extraction was determined by eigenvalues > 1 and supported by scree plot inspection. Model fit was subsequently examined using a MIRT 2-parameter logistic (2PL) model in R, with evaluation indicators RMSEA < 0.06, CFI > 0.90, and TLI > 0.90. Item parameters included discrimination (d) and difficulty (b), where discrimination was classified as: < 0.00 (unacceptable); 0.00–0.34 (very low); 0.35–0.64 (low); 0.65–1.34 (moderate); ≥ 1.35 (high). Findings show substantial variability in item performance. Most items demonstrated acceptable discrimination; however, 16 items had negative discrimination, indicating weaknesses in content representation and item construction. A few items (items 1, 3, 7, 10, and 22) showed high discrimination and are highly informative. Difficulty levels were dominated by easy items, limiting the test’s ability to distinguish medium- to high-ability examinees. The study recommends revising misfitting items, adding items with moderate difficulty and d > 0.65, and enhancing validity through Confirmatory Factor Analysis and bias detection using DIF analysis.
Copyrights © 2025