The standardization of Arabic language proficiency testing through the Test of Arabic as a Foreign Language (TOAFL) has become an urgent necessity amidst global education demands. However, aligning TOAFL instruments with the international standards of the Common European Framework of Reference for Languages (CEFR) still faces complex psychometric challenges. This study aims to conduct a systematic literature review on the psychometric evaluation of TOAFL instruments within the CEFR framework, focusing on validity, reliability, and the effectiveness of using Confidence Level (CL) to minimize the guessing factor. The method employed is a Systematic Literature Review (SLR) following the PRISMA protocol. An analysis was conducted on journal articles (ranging from 2018–2026) sourced from Google Scholar, DOAJ, and SINTA databases through content analysis and synthesis techniques. The findings indicate that most TOAFL instruments exhibit high reliability in Tarakib (Structure) and Qira’ah (Reading) aspects, equivalent to CEFR levels A2-B1. However, a significant gap exists in measuring levels B2-C2 due to a lack of productive-communicative items. The integration of Confidence Level has proven to enhance diagnostic accuracy regarding the test-takers' actual abilities compared to conventional multiple-choice tests. Reconstructing TOAFL instruments based on psychometric data aligned with CEFR descriptors is essential to ensure global validity and the international competitiveness of Arabic language program graduates.