Evaluation in learning through assessment plays an important role as a measure of success and assesses student competency achievement. In this context, CAT as an IRT-based adaptive assessment solution has been widely used, but has technical limitations such as heuristic question selection, dependence on question banks, and being undimensional. In addition, to solve decision-making problems in the context of adaptive testing, a general approach that can be used is policy-based reinforcement learning, such as policy gradient, particularly the REINFORCE algorithm. However, this algorithm has limitations such as high gradient variance and lacks a state-value function evaluation mechanism, making it unable to provide direct feedback on the quality of the actions taken. The purpose of this study is to optimize adaptive decision making in the POMDP framework using the Advantage Actor-Critic (A2C) algorithm, one of the Reinforcement Learning approaches. The actor generates a question selection policy based on the belief state of the NCDM model, while the critic evaluates the quality of actions to maximize cumulative rewards. The results show that in an adaptive environment, A2C performs better than the baseline, with an accuracy of 0.952 and an average reward of 18.56 in 20-question episodes, and an accuracy of 0.934 and a reward of 22.58 in 25-question episodes. In contrast, the baseline only achieved an average accuracy of around 0.789 and 0.760 in the 20 and 25 question episodes, and a reward of 14.19 and 16.80 in the 20 and 25 question episodes. The results of the study show that optimization with A2C can improve the personalization of exam question selection. This study contributes to the development of a more effective adaptive exam model, while also opening up opportunities for further research.