Abstract [eng] |
Reinforcement learning is one of the main paradigms of machine learning methodologies, a technique for sequential decision making in a fully or partially observed environment. The main objective of this approach is to optimise accumulated rewards over time. Generally, these rewards are scalar and have a single objective, but many real-world tasks involve multiple conflicting objectives that cannot be adequately described by a single scalar reward. This has led to a growing interest in the field of multi-objective reinforcement learning (MORL). A recently introduced method that can deliver the state of the art results in various multiobjective reinforcement learning task is the Pareto Conditioned Network (PCN). Using a single neural network to learn all non-dominated policies, PCN has several advantages over other MORL algorithms, including fewer hyperparameters and better scalability with large objective counts for multi-objective problems. However, the original PCN methods are only applicable to discrete action space problems, which is a major limitation since many real-world scenarios inherently require actions that are continuous rather than choosing actions from discrete, limited options. In order to overcome this limitation, this Master’s thesis presents a modified PCN that can operate efficiently in continuous action spaces. The main modification was related to the reconfiguration of the PCN model output layer as well as introducing a different exploration strategy. Instead of producing multiple discrete outputs, the network was adapted to produce a single continuous prediction representing a policy action. This change transformed the training task from a classification task to a regression task and required a change in the loss function which is used during training. In addition, the exploration strategy, which is an essential component of reinforcement learning, was significantly modified. In discrete action spaces, PCN facilitated exploration by sampling actions from a categorical distribution, where the probability of each action was proportional to its confidence score calculated using the softmax function in the last layer of the network. However, this approach was not compatible with continuous action spaces, where the output value directly reflects the action itself. Therefore, a new strategy was developed to encourage exploration while ensuring the network’s ability to learn and adapt to the continuous nature of actions. To evaluate the performance of the modified PCN method, different multi-object environments were used. These environments were selected to reflect the different complexity and characteristics that allow a reliable assessment of the performance of the method in different scenarios. The modified PCN method was compared to two other MORL algorithms. In summary, this Master’s thesis not only presents the necessary theoretical and practical adaptations to extend the PCN in continuous action spaces, but also provides a comprehensive evaluation of its effectiveness. The insights gained from this research represent a significant contribution to the field of multi-objective reinforcement learning, opening up new possibilities for applying these advanced techniques to a wider range of real-world problems. |