Reliable assessment of pre-service teachers' mathematical literacy skills is crucial to ensure their readiness to teach. However, conventional reliability indices often fail to disentangle various sources of measurement error simultaneously. This study applies Generalizability Theory (G-Theory) to evaluate and diagnose the reliability of a mathematics literacy instrument. A two-facet nested design, denoted as , was employed involving 30 undergraduate students (p) with six open ended essay item (i) evaluated by three independent raters (r). A G-study was conducted to estimated variance components, revealing that the largest source of variance originated from raters (92.8%). The initial analysis yielded d generalizability coefficient of 0.27 and dependability coefficient of 0.01, indicating low reliability due to substantial rater inconsistency. A D-study was conducted to evaluate alternative designs, however the results showed that even by increasing the number of items and raters, the G-coefficient 0.38 above the threshold of 0.80. These findings highlight that rater effects pose a major threat to measurement stability and suggest that the current instrument requires fundamental refinement of its scoring rubrics and procedures. This study serves as a critical diagnostic for teacher education programs, providing an empirical basis for optimizing assessment design and institutionalizing rater calibration to improve the evaluation of future educators.
Copyrights © 2026