LongViTU: Instruction Tuning for Long-Form Video Understanding

Wu, Rujie; Ma, Xiaojian; Ci, Hai; Fan, Yue; Wang, Yuxuan; Zhao, Haozhe; Li, Qing; Wang, Yizhou

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.05037 (cs)

[Submitted on 9 Jan 2025]

Title:LongViTU: Instruction Tuning for Long-Form Video Understanding

Authors:Rujie Wu, Xiaojian Ma, Hai Ci, Yue Fan, Yuxuan Wang, Haozhe Zhao, Qing Li, Yizhou Wang

View PDF HTML (experimental)

Abstract:This paper introduce LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. We developed a systematic approach that organizes videos into a hierarchical tree structure and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.); and 3) explicit timestamp labels for relevant events. LongViTU also serves as a benchmark for instruction following in long-form and streaming video understanding. We evaluate the open-source state-of-the-art long video understanding model, LongVU, and the commercial model, Gemini-1.5-Pro, on our benchmark. They achieve GPT-4 scores of 49.9 and 52.3, respectively, underscoring the substantial challenge posed by our benchmark. Further supervised fine-tuning (SFT) on LongVU led to performance improvements of 12.0% on our benchmark, 2.2% on the in-distribution (ID) benchmark EgoSchema, 1.0%, 2.2% and 1.2% on the out-of-distribution (OOD) benchmarks VideoMME (Long), WorldQA and OpenEQA, respectively. These outcomes demonstrate LongViTU's high data quality and robust OOD generalizability.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2501.05037 [cs.CV]
	(or arXiv:2501.05037v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.05037

Submission history

From: Rujie Wu [view email]
[v1] Thu, 9 Jan 2025 07:51:14 UTC (6,065 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LongViTU: Instruction Tuning for Long-Form Video Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LongViTU: Instruction Tuning for Long-Form Video Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators