MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference

Wan, Zhongwei; Shen, Hui; Wang, Xin; Liu, Che; Mai, Zheda; Zhang, Mi

Computer Science > Computation and Language

arXiv:2502.17599 (cs)

[Submitted on 24 Feb 2025]

Title:MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference

Authors:Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang

View PDF HTML (experimental)

Abstract:Long-context Multimodal Large Language Models (MLLMs) that incorporate long text-image and text-video modalities, demand substantial resources as their multimodal Key-Value (KV) caches grow with increasing input lengths, challenging inference efficiency. Existing methods for KV cache compression, in both text-only and multimodal LLMs, have neglected attention density variations across layers, thus often adopting uniform or progressive reduction strategies for layer-wise cache allocation. In this work, we propose MEDA, a dynamic layer-wise KV cache allocation method for efficient multimodal long-context inference. As its core, MEDA utilizes cross-modal attention entropy to determine the KV cache size at each MLLMs layer. Given the dynamically allocated KV cache size at each layer, MEDA also employs a KV pair selection scheme to identify which KV pairs to select and a KV pair merging strategy that merges the selected and non-selected ones to preserve information from the entire context. MEDA achieves up to 72% KV cache memory reduction and 2.82 times faster decoding speed, while maintaining or enhancing performance on various multimodal tasks in long-context settings, including multi-images and long-video scenarios. Our code is released at this https URL.

Comments:	NAACL 2025 Main
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2502.17599 [cs.CL]
	(or arXiv:2502.17599v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.17599

Submission history

From: Zhongwei Wan [view email]
[v1] Mon, 24 Feb 2025 19:34:52 UTC (9,881 KB)

Computer Science > Computation and Language

Title:MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators