Differentiable Trust Region Projection Layers

Otto, Fabian

Publikationsdienste
→
TOBIAS-lib - Publikationen und Dissertationen
→
7 Mathematisch-Naturwissenschaftliche Fakultät
→
Dokumentanzeige

« zurück

Differentiable Trust Region Projection Layers

Otto, Fabian

Dateien:	otto_fabian_dissertation.pdf 4.34 MB PDF Beschreibung: Main article

Zitierfähiger Link (URI):	http://hdl.handle.net/10900/159675 http://nbn-resolving.de/urn:nbn:de:bsz:21-dspace-1596751 http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1596756 http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1596754 http://dx.doi.org/10.15496/publikation-101008
Dokumentart:	Dissertation
Erscheinungsdatum:	2024-12-16
Sprache:	Englisch
Fakultät:	7 Mathematisch-Naturwissenschaftliche Fakultät
Fachbereich:	Informatik
Gutachter:	Neumann, Gerhard (Prof. Dr.)
Tag der mündl. Prüfung:	2024-07-02
DDC-Klassifikation:	004 - Informatik
Schlagworte:	Operante Konditionierung , Maschinelles Lernen , Robotik , Roboter , Fortbewegung , Manipulation
Freie Schlagwörter:	trust region off-policy on-policy off-policy on-policy trust region
Lizenz:	http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=de http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=en
Zur Langanzeige

Abstract:

Deep reinforcement learning and especially policy gradient methods have achieved remarkable success in various domains. However, challenges remain for policy gradient-based methods, characterized by issues such as premature exploitation and the difficulty of selecting appropriate step sizes. Mitigating these challenges requires nuanced approaches, and one effective strategy is to impose trust regions in the form of Kullback-Leibler divergence constraints on policy updates. Well-known methods such as Trust Region Policy Optimization and Proximal Policy Optimization adopt this approach, but they often rely on heuristic-based algorithms, exhibit implementation-dependent behavior, or lack scalability. In response to these limitations, this thesis introduces a novel algorithm based on differentiable trust region projection layers. This method offers a comprehensive and mathematically principled approach, ensuring efficiency, stability, and consistency for deep policy gradient methods. Importantly, the proposed algorithm delivers comparable or superior results to existing methods while remaining agnostic to specific implementation choices and enforcing the trust regions exactly per state. Moreover, it facilitates stable learning in high-dimensional and complex action spaces, making it particularly suitable for learning in the trajectory space through movement primitives from classical robotics. This integration combines the advantages of classical robotics, such as generating smooth and energy-efficient trajectories as well as adapting to sparse and non-Markovian rewards, with the scalability of deep reinforcement learning methods. Additionally, we extend this method from the on-policy setting to the off-policy setting and also eliminate the need for an explicit state-action-value function while preserving learning stability. This innovation streamlines the learning process and enhances exploration and exploitation efficiency for off-policy learning, especially in higher-dimensional action spaces. All proposed algorithms are validated through extensive experiments on a variety of simulated tasks, including locomotion and manipulation.

Das Dokument erscheint in:

7 Mathematisch-Naturwissenschaftliche Fakultät [4847]

Veröffentlichen

Stöbern

Gesamter Bestand
Diese Sammlung

Mein Benutzerkonto

Einloggen

Differentiable Trust Region Projection Layers

DSpace Repositorium (Manakin basiert)

Differentiable Trust Region Projection Layers

Abstract:

Das Dokument erscheint in:

Stöbern

Gesamter Bestand

Diese Sammlung

Mein Benutzerkonto