his study investigates the application of \emph{risk-sensitive reinforcement learning} on heavy-tailed return series by comparing two primary algorithms: REINFORCE with baseline (REINFORCE-BL) and episodic batched actor--critic (A2C-B). Initial exploratory analysis reveals an asymmetric return distribution with numerous extreme \emph{outliers}, rendering variance-based risk measures inadequate and motivating the integration of tail-based risk measures—specifically Value at Risk (VaR), Conditional Value at Risk (CVaR), and Entropic Value at Risk (EVaR)—into the RL objective function. This study constructs a simple portfolio environment with discrete actions (market entry, market exit, and \emph{hold}) and trains both algorithms under four scenarios: risk-neutral, VaR, CVaR, and EVaR. Experimental results demonstrate that A2C-B consistently outperforms REINFORCE-BL across all scenarios, exhibiting higher average long-term rewards, faster convergence rates, and more stable \emph{learning curves}. While VaR and CVaR penalties significantly reduce rewards and increase learning volatility for REINFORCE-BL, A2C-B experiences only moderate reward reductions while maintaining stability. In the EVaR scenario, both algorithms yield high rewards, yet A2C-B retains a slight advantage in terms of stability. These findings indicate that in environments with heavy-tailed returns, employing coherent risk measures (particularly CVaR and EVaR) within an actor--critic framework offers a more compelling trade-off between tail risk control and average performance, serving as a viable \emph{baseline} for the development of risk-sensitive RL in finance and actuarial science.
Copyrights © 2026