Background: Test instruments play an important role in measuring abilities, including in competitive events such as the Arabic Grammar Olympiad; therefore, standardized, valid test items that support learning outcomes are required to map participants' ability profiles accurately. The Indonesian government also emphasizes the concept of Deep Learning, which focuses not only on cognitive aspects but also on achieving graduate profile outcomes. Purpose: This study aims to analyze Olympiad test items in terms of validity, reliability, difficulty level, discrimination power, distractor effectiveness, cognitive level, and their alignment with the eight Deep Learning graduate profiles of the Merdeka Curriculum. Method: This study uses a mixed-method approach, consisting of a quantitative ex post facto design and a qualitative content analysis design. Results and Discussion: The findings indicate that the Olympiad test items demonstrate high reliability; However, only 43% of the items are valid, 87% are categorized as easy in terms of difficulty level, 57% have poor discrimination power, and only 40% of the distractors are effective, and the cognitive level of the items is still dominated by the Lower Order Thinking Skills (LOTS) category (67%). Furthermore, the test items do not fully represent the eight Deep Learning graduate profiles, as only three profiles are reflected: independence (100%), faith and devotion to God Almighty (33%), and critical reasoning (20%). Conclusions and Implications: This study concludes that the Arabic Grammar Olympiad test items still require improvement to function optimally as instruments for measuring participants’ abilities while supporting the achievement of Deep Learning graduate profiles.