ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement

Rao, Zhefan; Ji, Liya; Xing, Yazhou; Liu, Runtao; Liu, Zhaoyang; Xie, Jiaxin; Peng, Ziqiao; He, Yingqing; Chen, Qifeng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.18966 (cs)

[Submitted on 25 Dec 2024]

Title:ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement

Authors:Zhefan Rao, Liya Ji, Yazhou Xing, Runtao Liu, Zhaoyang Liu, Jiaxin Xie, Ziqiao Peng, Yingqing He, Qifeng Chen

View PDF HTML (experimental)

Abstract:Text-to-video (T2V) generation has gained significant attention recently. However, the costs of training a T2V model from scratch remain persistently high, and there is considerable room for improving the generation performance, especially under limited computation resources. This work explores the continual general pre-training of text-to-video models, enabling the model to "grow" its abilities based on a pre-trained foundation, analogous to how humans acquire new knowledge based on past experiences. There is a lack of extensive study of the continual pre-training techniques in T2V generation. In this work, we take the initial step toward exploring this task systematically and propose ModelGrow. Specifically, we break this task into two key aspects: increasing model capacity and improving semantic understanding. For model capacity, we introduce several novel techniques to expand the model size, enabling it to store new knowledge and improve generation performance. For semantic understanding, we propose a method that leverages large language models as advanced text encoders, integrating them into T2V models to enhance language comprehension and guide generation results according to detailed prompts. This approach enables the model to achieve better semantic alignment, particularly in response to complex user prompts. Extensive experiments demonstrate the effectiveness of our method across various metrics. The source code and the model of ModelGrow will be publicly available.

Comments:	18 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2412.18966 [cs.CV]
	(or arXiv:2412.18966v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.18966

Submission history

From: Zhefan Rao [view email]
[v1] Wed, 25 Dec 2024 18:58:07 UTC (44,595 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators