MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation

Song, Yiren; Liu, Cheng; Shou, Mike Zheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.01572 (cs)

[Submitted on 3 Feb 2025 (v1), last revised 5 Feb 2025 (this version, v2)]

Title:MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation

Authors:Yiren Song, Cheng Liu, Mike Zheng Shou

View PDF HTML (experimental)

Abstract:A hallmark of human intelligence is the ability to create complex artifacts through structured multi-step processes. Generating procedural tutorials with AI is a longstanding but challenging goal, facing three key obstacles: (1) scarcity of multi-task procedural datasets, (2) maintaining logical continuity and visual consistency between steps, and (3) generalizing across multiple domains. To address these challenges, we propose a multi-domain dataset covering 21 tasks with over 24,000 procedural sequences. Building upon this foundation, we introduce MakeAnything, a framework based on the diffusion transformer (DIT), which leverages fine-tuning to activate the in-context capabilities of DIT for generating consistent procedural sequences. We introduce asymmetric low-rank adaptation (LoRA) for image generation, which balances generalization capabilities and task-specific performance by freezing encoder parameters while adaptively tuning decoder layers. Additionally, our ReCraft model enables image-to-process generation through spatiotemporal consistency constraints, allowing static images to be decomposed into plausible creation sequences. Extensive experiments demonstrate that MakeAnything surpasses existing methods, setting new performance benchmarks for procedural generation tasks.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2502.01572 [cs.CV]
	(or arXiv:2502.01572v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.01572

Submission history

From: Yiren Song [view email]
[v1] Mon, 3 Feb 2025 17:55:30 UTC (28,579 KB)
[v2] Wed, 5 Feb 2025 02:44:42 UTC (28,579 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators