TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation

Li, Ruineng; Xing, Daitao; Sun, Huiming; Ha, Yuanzhou; Shen, Jinglin; Ho, Chiuman

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.08181 (cs)

[Submitted on 11 Apr 2025]

Title:TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation

Authors:Ruineng Li, Daitao Xing, Huiming Sun, Yuanzhou Ha, Jinglin Shen, Chiuman Ho

View PDF

Abstract:Human-centric motion control in video generation remains a critical challenge, particularly when jointly controlling camera movements and human poses in scenarios like the iconic Grammy Glambot moment. While recent video diffusion models have made significant progress, existing approaches struggle with limited motion representations and inadequate integration of camera and human motion controls. In this work, we present TokenMotion, the first DiT-based video diffusion framework that enables fine-grained control over camera motion, human motion, and their joint interaction. We represent camera trajectories and human poses as spatio-temporal tokens to enable local control granularity. Our approach introduces a unified modeling framework utilizing a decouple-and-fuse strategy, bridged by a human-aware dynamic mask that effectively handles the spatially-and-temporally varying nature of combined motion signals. Through extensive experiments, we demonstrate TokenMotion's effectiveness across both text-to-video and image-to-video paradigms, consistently outperforming current state-of-the-art methods in human-centric motion control tasks. Our work represents a significant advancement in controllable video generation, with particular relevance for creative production applications.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2504.08181 [cs.CV]
	(or arXiv:2504.08181v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.08181

Submission history

From: Ruineng Li [view email]
[v1] Fri, 11 Apr 2025 00:41:25 UTC (38,320 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators