SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

Wu, Yushu; Zhang, Zhixing; Li, Yanyu; Xu, Yanwu; Kag, Anil; Sui, Yang; Coskun, Huseyin; Ma, Ke; Lebedev, Aleksei; Hu, Ju; Metaxas, Dimitris; Wang, Yanzhi; Tulyakov, Sergey; Ren, Jian

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.10494 (cs)

[Submitted on 13 Dec 2024]

Title:SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

Authors:Yushu Wu, Zhixing Zhang, Yanyu Li, Yanwu Xu, Anil Kag, Yang Sui, Huseyin Coskun, Ke Ma, Aleksei Lebedev, Ju Hu, Dimitris Metaxas, Yanzhi Wang, Sergey Tulyakov, Jian Ren

View PDF HTML (experimental)

Abstract:We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts. However, as a supertask of image generation, video generation models require more computation and are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users. From the network architecture scope, we initialize from a compact image backbone and search out the design and arrangement of temporal layers to maximize hardware efficiency. In addition, we propose a dedicated adversarial fine-tuning algorithm for our efficient model and reduce the denoising steps to 4. Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds. Compared to server-side models that take minutes on powerful GPUs to generate a single video, we accelerate the generation by magnitudes while delivering on-par quality.

Comments:	this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
Cite as:	arXiv:2412.10494 [cs.CV]
	(or arXiv:2412.10494v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.10494

Submission history

From: Yanyu Li [view email]
[v1] Fri, 13 Dec 2024 18:59:56 UTC (21,250 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators