Abstract [eng] |
This work explores and evaluates the performance of different reinforcement learning methods for training an agent in a simulated environment. The primary focus was on Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) algorithms, comparing variations such as PPO with Adaptive KL Penalty, PPO with Trust Region Optimization, SAC with Fixed Temperature, and SAC with Automatic Temperature Adjustment. The problem addressed in this study involves optimizing policy learning in dynamic environments, ensuring stable convergence and maximizing cumulative rewards as well as choosing the right method according to the scenario. The experiments were conducted in two different environments, where each method was tested and analyzed based on metrics such as reward per episode, policy loss, value loss, and KL divergence. Results demonstrated that SAC with Automatic Temperature Adjustment performed best in the first environment due to its adaptability in regulating exploration and exploitation. In contrast, PPO with Trust Region Optimization excelled in the second environment, showcasing improved stability and reward accumulation. Key hyperparameters, including gamma (discount factor), learning rate, entropy coefficients, KL penalties, and batch sizes, were systematically tuned to achieve optimal performance. The findings of this study suggest that SAC-based methods offer better adaptability, especially in environments requiring dynamic exploration, while PPO-based approaches ensure stability in controlled settings with constrained policy updates. These insights contribute to the ongoing improvements in deep reinforcement learning applications for real-world decision-making tasks. |