Reinforcement Learning (RL) has achieved remarkable success in complex sequential decision tasks. However, modern RL models often lack explainability, creating a serious "black box" problem, especially in high-stakes domains. This study proposes a Pygame-based real-time visualization architecture for RL, and demonstrates its benefits in a Cliff Walking case study using Q-Learning and SARSA algorithms. Key contributions include: (1) a real-time visualization architecture that decouples training logic from graphics rendering with support more than 60 FPS, (2) interpretive visualization techniques including diverging heatmaps, dynamic policy arrows, and Ghost Policies, and (3) a comprehensive empirical study clarifying the distinct characteristics of both algorithms. Experimental results clearly show that Q-Learning selects an efficient but risky path aligned with its optimistic off-policy nature, while SARSA converges on a safer path reflecting its on-policy nature that considers exploration safety. Quantitatively, Q-Learning successfully achieved an optimal 13-step path with an accumulation of 10,642 falls, whereas SARSA converged to a safe 23-step path with a significantly higher collision frequency (232,844 times) to avoid extreme penalties from the cliff zone.
Copyrights © 2026