InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

Ataallah, Kirolos; Gou, Chenhui; Abdelrahman, Eslam; Pahwa, Khushbu; Ding, Jian; Elhoseiny, Mohamed

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.19875 (cs)

[Submitted on 28 Jun 2024 (v1), last revised 31 Aug 2024 (this version, v2)]

Title:InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

Authors:Kirolos Ataallah, Chenhui Gou, Eslam Abdelrahman, Khushbu Pahwa, Jian Ding, Mohamed Elhoseiny

View PDF HTML (experimental)

Abstract:Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. Despite the increasing importance of long-form video content, existing benchmarks primarily focus on shorter clips. To address this gap, we introduce InfiniBench a comprehensive benchmark for very long video understanding which presents 1)The longest video duration, averaging 52.59 minutes per video 2) The largest number of question-answer pairs, 108.2K 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions 4) Human-centric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large Multi-Modality Models (LMMs) on each skill, including the commercial models such as GPT-4o and Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark. Our findings reveal that even leading AI models like GPT-4o and Gemini 1.5 Flash face challenges in achieving high performance in long video understanding, with average accuracies of just 49.16\% and 42.72\%, and average scores of 3.22 and 2.71 out of 5, respectively. We hope this benchmark will stimulate the LMMs community towards long video and human-level understanding. Our benchmark can be accessed at this https URL

Comments:	24 pages,25 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.19875 [cs.CV]
	(or arXiv:2406.19875v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.19875

Submission history

From: Kirolos Ataallah [view email]
[v1] Fri, 28 Jun 2024 12:35:01 UTC (2,869 KB)
[v2] Sat, 31 Aug 2024 10:34:37 UTC (3,785 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators