VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Tan, Zhiyu; Yang, Xiaomeng; Qin, Luozheng; Li, Hao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.02629 (cs)

[Submitted on 5 Aug 2024]

Title:VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Authors:Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Hao Li

View PDF HTML (experimental)

Abstract:The quality of video-text pairs fundamentally determines the upper bound of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manual rule-based curation, leads to a high computational load and leaves behind unclean data. As a result, there is a lack of appropriate training datasets for text-to-video models. To address this problem, we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.

Comments:	project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2408.02629 [cs.CV]
	(or arXiv:2408.02629v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.02629

Submission history

From: Luozheng Qin [view email]
[v1] Mon, 5 Aug 2024 16:53:23 UTC (7,832 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators