Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding

Li, Yun; Liu, Zhe; Kong, Yajing; Li, Guangrui; Zhang, Jiyuan; Bian, Chao; Liu, Feng; Yao, Lina; Sun, Zhenbang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.16786 (cs)

[Submitted on 28 Jan 2025]

Title:Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding

Authors:Yun Li, Zhe Liu, Yajing Kong, Guangrui Li, Jiyuan Zhang, Chao Bian, Feng Liu, Lina Yao, Zhenbang Sun

View PDF HTML (experimental)

Abstract:Applying Multimodal Large Language Models (MLLMs) to video understanding presents significant challenges due to the need to model temporal relations across frames. Existing approaches adopt either implicit temporal modeling, relying solely on the LLM decoder, or explicit temporal modeling, employing auxiliary temporal encoders. To investigate this debate between the two paradigms, we propose the Stackable Temporal Encoder (STE). STE enables flexible explicit temporal modeling with adjustable temporal receptive fields and token compression ratios. Using STE, we systematically compare implicit and explicit temporal modeling across dimensions such as overall performance, token compression effectiveness, and temporal-specific understanding. We also explore STE's design considerations and broader impacts as a plug-in module and in image modalities. Our findings emphasize the critical role of explicit temporal modeling, providing actionable insights to advance video MLLMs.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2501.16786 [cs.CV]
	(or arXiv:2501.16786v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.16786

Submission history

From: Yun Li [view email]
[v1] Tue, 28 Jan 2025 08:30:58 UTC (39,475 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators