Multi-subject Open-set Personalization in Video Generation

Chen, Tsai-Shien; Siarohin, Aliaksandr; Menapace, Willi; Fang, Yuwei; Lee, Kwot Sin; Skorokhodov, Ivan; Aberman, Kfir; Zhu, Jun-Yan; Yang, Ming-Hsuan; Tulyakov, Sergey

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.06187 (cs)

[Submitted on 10 Jan 2025]

Title:Multi-subject Open-set Personalization in Video Generation

Authors:Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, Sergey Tulyakov

View PDF HTML (experimental)

Abstract:Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.

Comments:	Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.06187 [cs.CV]
	(or arXiv:2501.06187v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.06187

Submission history

From: Tsai-Shien Chen [view email]
[v1] Fri, 10 Jan 2025 18:59:54 UTC (28,179 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-subject Open-set Personalization in Video Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-subject Open-set Personalization in Video Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators