Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

Wang, Xiaochen; Xia, Heming; Song, Jialin; Guan, Longyu; Yang, Yixin; Dong, Qingxiu; Luo, Weiyao; Pu, Yifan; Wang, Yiru; Meng, Xiangdi; Li, Wenjie; Sui, Zhifang

Computer Science > Computation and Language

arXiv:2502.13925 (cs)

[Submitted on 19 Feb 2025]

Title:Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

Authors:Xiaochen Wang, Heming Xia, Jialin Song, Longyu Guan, Yixin Yang, Qingxiu Dong, Weiyao Luo, Yifan Pu, Yiru Wang, Xiangdi Meng, Wenjie Li, Zhifang Sui

View PDF HTML (experimental)

Abstract:Large Multimodal Models (LMMs) have achieved remarkable success across various visual-language tasks. However, existing benchmarks predominantly focus on single-image understanding, leaving the analysis of image sequences largely unexplored. To address this limitation, we introduce StripCipher, a comprehensive benchmark designed to evaluate capabilities of LMMs to comprehend and reason over sequential images. StripCipher comprises a human-annotated dataset and three challenging subtasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. Our evaluation of $16$ state-of-the-art LMMs, including GPT-4o and Qwen2.5VL, reveals a significant performance gap compared to human capabilities, particularly in tasks that require reordering shuffled sequential images. For instance, GPT-4o achieves only 23.93% accuracy in the reordering subtask, which is 56.07% lower than human performance. Further quantitative analysis discuss several factors, such as input format of images, affecting the performance of LLMs in sequential understanding, underscoring the fundamental challenges that remain in the development of LMMs.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2502.13925 [cs.CL]
	(or arXiv:2502.13925v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.13925

Submission history

From: Xiaochen Wang [view email]
[v1] Wed, 19 Feb 2025 18:04:44 UTC (1,916 KB)

Computer Science > Computation and Language

Title:Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators