Movie2Story: A framework for understanding videos and telling stories in the form of novel text

Li, Kangning; Jia, Zheyang; Ying, Anyu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.14965 (cs)

[Submitted on 19 Dec 2024]

Title:Movie2Story: A framework for understanding videos and telling stories in the form of novel text

Authors:Kangning Li, Zheyang Jia, Anyu Ying

View PDF HTML (experimental)

Abstract:Multimodal video-to-text models have made considerable progress, primarily in generating brief descriptions of video content. However, there is still a deficiency in generating rich long-form text descriptions that integrate both video and audio. In this paper, we introduce a framework called M2S, designed to generate novel-length text by combining audio, video, and character recognition. M2S includes modules for video long-form text description and comprehension, audio-based analysis of emotion, speech rate, and character alignment, and visual-based character recognition alignment. By integrating multimodal information using the large language model GPT4o, M2S stands out in the field of multimodal text generation. We demonstrate the effectiveness and accuracy of M2S through comparative experiments and human evaluation. Additionally, the model framework has good scalability and significant potential for future research.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2412.14965 [cs.CV]
	(or arXiv:2412.14965v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.14965

Submission history

From: Kangning Li [view email]
[v1] Thu, 19 Dec 2024 15:44:04 UTC (188 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Movie2Story: A framework for understanding videos and telling stories in the form of novel text

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Movie2Story: A framework for understanding videos and telling stories in the form of novel text

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators