HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation

Gan, Qijun; Ren, Yi; Zhang, Chen; Ye, Zhenhui; Xie, Pan; Yin, Xiang; Yuan, Zehuan; Peng, Bingyue; Zhu, Jianke

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.04847 (cs)

[Submitted on 7 Feb 2025 (v1), last revised 10 Feb 2025 (this version, v2)]

Title:HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation

Authors:Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xiang Yin, Zehuan Yuan, Bingyue Peng, Jianke Zhu

View PDF HTML (experimental)

Abstract:Human motion video generation has advanced significantly, while existing methods still struggle with accurately rendering detailed body parts like hands and faces, especially in long sequences and intricate motions. Current approaches also rely on fixed resolution and struggle to maintain visual consistency. To address these limitations, we propose HumanDiT, a pose-guided Diffusion Transformer (DiT)-based framework trained on a large and wild dataset containing 14,000 hours of high-quality video to produce high-fidelity videos with fine-grained body rendering. Specifically, (i) HumanDiT, built on DiT, supports numerous video resolutions and variable sequence lengths, facilitating learning for long-sequence video generation; (ii) we introduce a prefix-latent reference strategy to maintain personalized characteristics across extended sequences. Furthermore, during inference, HumanDiT leverages Keypoint-DiT to generate subsequent pose sequences, facilitating video continuation from static images or existing videos. It also utilizes a Pose Adapter to enable pose transfer with given sequences. Extensive experiments demonstrate its superior performance in generating long-form, pose-accurate videos across diverse scenarios.

Comments:	this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2502.04847 [cs.CV]
	(or arXiv:2502.04847v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.04847

Submission history

From: Qijun Gan [view email]
[v1] Fri, 7 Feb 2025 11:36:36 UTC (18,283 KB)
[v2] Mon, 10 Feb 2025 14:51:29 UTC (18,283 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators