Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding

Chatterjee, Dibyadip; Remelli, Edoardo; Song, Yale; Tekin, Bugra; Mittal, Abhay; Bhatnagar, Bharat; Camgöz, Necati Cihan; Hampali, Shreyas; Sauser, Eric; Ma, Shugao; Yao, Angela; Sener, Fadime

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.13915 (cs)

[Submitted on 10 Apr 2025]

Title:Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding

Authors:Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bugra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Cihan Camgöz, Shreyas Hampali, Eric Sauser, Shugao Ma, Angela Yao, Fadime Sener

View PDF HTML (experimental)

Abstract:We introduce ProVideLLM, an end-to-end framework for real-time procedural video understanding. ProVideLLM integrates a multimodal cache configured to store two types of tokens - verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by 22x over existing methods in representing one hour of long-term observations while effectively encoding fine-granularity of the present. By interleaving these tokens in our multimodal cache, ProVideLLM ensures sub-linear scaling of memory and compute with video length, enabling per-frame streaming inference at 10 FPS and streaming dialogue at 25 FPS, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.

Comments:	13 pages, 5 figures; this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2504.13915 [cs.CV]
	(or arXiv:2504.13915v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.13915

Submission history

From: Dibyadip Chatterjee [view email]
[v1] Thu, 10 Apr 2025 17:13:08 UTC (7,095 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators