Towards Fine-grained 3D Human Motion Generation from Textual Instructions

Athanasiou, Nikolaos

Publikationsdienste
→
TOBIAS-lib - Publikationen und Dissertationen
→
7 Mathematisch-Naturwissenschaftliche Fakultät
→
Dokumentanzeige

dc.contributor.advisor	Black, Michael J. (Prof. Dr.)
dc.contributor.author	Athanasiou, Nikolaos
dc.date.accessioned	2026-04-23T14:30:41Z
dc.date.available	2026-04-23T14:30:41Z
dc.date.issued	2026-04-23
dc.identifier.uri	http://hdl.handle.net/10900/178573
dc.identifier.uri	http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1785733	de_DE
dc.identifier.uri	http://dx.doi.org/10.15496/publikation-119897
dc.description.abstract	Controlling 3D human motion through natural language is key to unlocking interactive experiences in animation, gaming, virtual reality, and robotics. Yet current generative models still struggle with the compositional nature of real human behavior — actions unfold in sequence, overlap in time, and need fine-grained editing, all of which demand more than single-prompt, single-motion generation. This thesis argues that truly controllable motion generation requires compositional thinking: the ability to chain actions over time, layer them across body parts, and refine them through iterative editing. I present a suite of methods and datasets that tackle each of these axes. I first introduce TEACH, a hierarchical Transformer-based model that generates temporally coherent motion sequences from a series of textual descriptions, handling transitions between consecutive actions. I then address spatial composition with SINC, which synthesizes simultaneous actions — such as waving while walking — by leveraging structural knowledge about body-part involvement extracted from large language models. Moving from generation to editing, I present MotionFix, a dataset of source–target–edit-text triplets, together with TMED, a conditional diffusion model that modifies existing motions according to fine-grained textual instructions. Underpinning these contributions is BABEL, a large-scale dataset of semantically rich, frame-level annotations for motion-capture data that serves as a shared foundation for training and benchmarking across all three tasks. Together, these contributions advance language-driven motion generation from isolated actions toward the compositional, editable control that real-world applications demand.	en
dc.language.iso	en	de_DE
dc.publisher	Universität Tübingen	de_DE
dc.rights	ubt-podno	de_DE
dc.rights.uri	http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=de	de_DE
dc.rights.uri	http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=en	en
dc.subject.ddc	500	de_DE
dc.subject.other	3D	en
dc.subject.other	3D human motion	en
dc.subject.other	3D computer vision	en
dc.subject.other	3D humans	en
dc.title	Towards Fine-grained 3D Human Motion Generation from Textual Instructions	en
dc.type	PhDThesis	de_DE
dcterms.dateAccepted	2026-01-15
utue.publikation.fachbereich	Informatik	de_DE
utue.publikation.fakultaet	7 Mathematisch-Naturwissenschaftliche Fakultät	de_DE
utue.publikation.noppn	yes	de_DE

Dateien:	thesis_b5_orig_pdfa.pdf 13.8 MB PDF Beschreibung: Main Article (thesis ...

Das Dokument erscheint in:

7 Mathematisch-Naturwissenschaftliche Fakultät [5298]

Zur Kurzanzeige

Veröffentlichen

Stöbern

Gesamter Bestand
Diese Sammlung

Mein Benutzerkonto

Einloggen

Towards Fine-grained 3D Human Motion Generation from Textual Instructions

DSpace Repositorium (Manakin basiert)

Das Dokument erscheint in:

Stöbern

Gesamter Bestand

Diese Sammlung

Mein Benutzerkonto