Text-to-Edit: Controllable End-to-End Video Ad Creation via Multimodal LLMs

Cheng, Dabing; Zhan, Haosen; Zhao, Xingchen; Liu, Guisheng; Li, Zemin; Xie, Jinghui; Song, Zhao; Feng, Weiguo; Peng, Bingyue

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.05884 (cs)

[Submitted on 10 Jan 2025]

Title:Text-to-Edit: Controllable End-to-End Video Ad Creation via Multimodal LLMs

Authors:Dabing Cheng, Haosen Zhan, Xingchen Zhao, Guisheng Liu, Zemin Li, Jinghui Xie, Zhao Song, Weiguo Feng, Bingyue Peng

View PDF HTML (experimental)

Abstract:The exponential growth of short-video content has ignited a surge in the necessity for efficient, automated solutions to video editing, with challenges arising from the need to understand videos and tailor the editing according to user requirements. Addressing this need, we propose an innovative end-to-end foundational framework, ultimately actualizing precise control over the final video content editing. Leveraging the flexibility and generalizability of Multimodal Large Language Models (MLLMs), we defined clear input-output mappings for efficient video creation. To bolster the model's capability in processing and comprehending video content, we introduce a strategic combination of a denser frame rate and a slow-fast processing technique, significantly enhancing the extraction and understanding of both temporal and spatial video information. Furthermore, we introduce a text-to-edit mechanism that allows users to achieve desired video outcomes through textual input, thereby enhancing the quality and controllability of the edited videos. Through comprehensive experimentation, our method has not only showcased significant effectiveness within advertising datasets, but also yields universally applicable conclusions on public datasets.

Comments:	16pages conference
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.05884 [cs.CV]
	(or arXiv:2501.05884v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.05884

Submission history

From: Dabing Cheng [view email]
[v1] Fri, 10 Jan 2025 11:35:43 UTC (15,440 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Text-to-Edit: Controllable End-to-End Video Ad Creation via Multimodal LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Text-to-Edit: Controllable End-to-End Video Ad Creation via Multimodal LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators