Background: This study investigates the use of the Proximal Policy Optimization (PPO) algorithm in two text-based case studies: alignment of large language models (LLMs) with human preferences and dynamic pricing based on customer reviews. In the LLM case, PPO combined with preference-based learning significantly improves alignment, BLEU, and human-likeness scores.Objective: This research aims to evaluate PPO’s effectiveness in text-based decision-making through these two cases.Methods: The method employed is reinforcement learning experimentation using the PPO approach. For the LLM case, PPO is integrated with preference learning to enhance alignment, BLEU, and human-like output. Meanwhile, in the economic scenario, PPO produces adaptive pricing strategies with high accuracy or low Mean Absolute Error (MAE) and the best cumulative rewards, outperforming the A3C and DDPG algorithms. Cross-validation and ablation studies assessed PPO’s generalization capability and the contribution of reward components, clipping, and exploration strategies.Result: The findings demonstrate that PPO excels across distinct domains and offers a stable and efficient solution for text-based tasks.Conclusion: The findings confirm its flexibility for various NLP applications and intelligent decision-making systems
Copyrights © 2025