Abstract:
Deep reinforcement learning and especially policy gradient methods have achieved remarkable success in various domains.
However, challenges remain for policy gradient-based methods, characterized by issues such as premature exploitation and the difficulty of selecting appropriate step sizes.
Mitigating these challenges requires nuanced approaches, and one effective strategy is to impose trust regions in the form of Kullback-Leibler divergence constraints on policy updates.
Well-known methods such as Trust Region Policy Optimization and Proximal Policy Optimization adopt this approach, but they often rely on heuristic-based algorithms, exhibit implementation-dependent behavior, or lack scalability.
In response to these limitations, this thesis introduces a novel algorithm based on differentiable trust region projection layers.
This method offers a comprehensive and mathematically principled approach, ensuring efficiency, stability, and consistency for deep policy gradient methods.
Importantly, the proposed algorithm delivers comparable or superior results to existing methods while remaining agnostic to specific implementation choices and enforcing the trust regions exactly per state.
Moreover, it facilitates stable learning in high-dimensional and complex action spaces, making it particularly suitable for learning in the trajectory space through movement primitives from classical robotics.
This integration combines the advantages of classical robotics, such as generating smooth and energy-efficient trajectories as well as adapting to sparse and non-Markovian rewards, with the scalability of deep reinforcement learning methods.
Additionally, we extend this method from the on-policy setting to the off-policy setting and also eliminate the need for an explicit state-action-value function while preserving learning stability.
This innovation streamlines the learning process and enhances exploration and exploitation efficiency for off-policy learning, especially in higher-dimensional action spaces.
All proposed algorithms are validated through extensive experiments on a variety of simulated tasks, including locomotion and manipulation.