Streaming Video Question-Answering with In-context Video KV-Cache Retrieval

Di, Shangzhe; Yu, Zhelun; Zhang, Guanghao; Li, Haoyuan; Zhong, Tao; Cheng, Hao; Li, Bolin; He, Wanggui; Shu, Fangxun; Jiang, Hao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.00540 (cs)

[Submitted on 1 Mar 2025]

Title:Streaming Video Question-Answering with In-context Video KV-Cache Retrieval

Authors:Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang

View PDF HTML (experimental)

Abstract:We propose ReKV, a novel training-free approach that enables efficient streaming video question-answering (StreamingVQA), by seamlessly integrating with existing Video Large Language Models (Video-LLMs). Traditional VideoQA systems struggle with long videos, as they must process entire videos before responding to queries, and repeat this process for each new question. In contrast, our approach analyzes long videos in a streaming manner, allowing for prompt responses as soon as user queries are received. Building on a common Video-LLM, we first incorporate a sliding-window attention mechanism, ensuring that input frames attend to a limited number of preceding frames, thereby reducing computational overhead. To prevent information loss, we store processed video key-value caches (KV-Caches) in RAM and disk, reloading them into GPU memory as needed. Additionally, we introduce a retrieval method that leverages an external retriever or the parameters within Video-LLMs to retrieve only query-relevant KV-Caches, ensuring both efficiency and accuracy in question answering. ReKV enables the separation of video encoding and question-answering across different processes and GPUs, significantly enhancing the efficiency of StreamingVQA. Through comprehensive experimentation, we validate the efficacy and practicality of our approach, which significantly boosts efficiency and enhances applicability over existing VideoQA models.

Comments:	Accepted to ICLR 2025. Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.00540 [cs.CV]
	(or arXiv:2503.00540v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.00540

Submission history

From: Shangzhe Di [view email]
[v1] Sat, 1 Mar 2025 15:53:33 UTC (9,117 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Streaming Video Question-Answering with In-context Video KV-Cache Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Streaming Video Question-Answering with In-context Video KV-Cache Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators