Towards Fine-grained 3D Human Motion Generation from Textual Instructions

DSpace Repositorium (Manakin basiert)

Zur Kurzanzeige

dc.contributor.advisor Black, Michael J. (Prof. Dr.)
dc.contributor.author Athanasiou, Nikolaos
dc.date.accessioned 2026-04-23T14:30:41Z
dc.date.available 2026-04-23T14:30:41Z
dc.date.issued 2026-04-23
dc.identifier.uri http://hdl.handle.net/10900/178573
dc.identifier.uri http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1785733 de_DE
dc.identifier.uri http://dx.doi.org/10.15496/publikation-119897
dc.description.abstract Controlling 3D human motion through natural language is key to unlocking interactive experiences in animation, gaming, virtual reality, and robotics. Yet current generative models still struggle with the compositional nature of real human behavior — actions unfold in sequence, overlap in time, and need fine-grained editing, all of which demand more than single-prompt, single-motion generation. This thesis argues that truly controllable motion generation requires compositional thinking: the ability to chain actions over time, layer them across body parts, and refine them through iterative editing. I present a suite of methods and datasets that tackle each of these axes. I first introduce TEACH, a hierarchical Transformer-based model that generates temporally coherent motion sequences from a series of textual descriptions, handling transitions between consecutive actions. I then address spatial composition with SINC, which synthesizes simultaneous actions — such as waving while walking — by leveraging structural knowledge about body-part involvement extracted from large language models. Moving from generation to editing, I present MotionFix, a dataset of source–target–edit-text triplets, together with TMED, a conditional diffusion model that modifies existing motions according to fine-grained textual instructions. Underpinning these contributions is BABEL, a large-scale dataset of semantically rich, frame-level annotations for motion-capture data that serves as a shared foundation for training and benchmarking across all three tasks. Together, these contributions advance language-driven motion generation from isolated actions toward the compositional, editable control that real-world applications demand. en
dc.language.iso en de_DE
dc.publisher Universität Tübingen de_DE
dc.rights ubt-podno de_DE
dc.rights.uri http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=de de_DE
dc.rights.uri http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=en en
dc.subject.ddc 500 de_DE
dc.subject.other 3D en
dc.subject.other 3D human motion en
dc.subject.other 3D computer vision en
dc.subject.other 3D humans en
dc.title Towards Fine-grained 3D Human Motion Generation from Textual Instructions en
dc.type PhDThesis de_DE
dcterms.dateAccepted 2026-01-15
utue.publikation.fachbereich Informatik de_DE
utue.publikation.fakultaet 7 Mathematisch-Naturwissenschaftliche Fakultät de_DE
utue.publikation.noppn yes de_DE

Dateien:

Das Dokument erscheint in:

Zur Kurzanzeige