Marconi: Prefix Caching for the Era of Hybrid LLMs

Pan, Rui; Wang, Zhuang; Jia, Zhen; Karakus, Can; Zancato, Luca; Dao, Tri; Wang, Yida; Netravali, Ravi

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2411.19379 (cs)

[Submitted on 28 Nov 2024 (v1), last revised 4 Dec 2024 (this version, v2)]

Title:Marconi: Prefix Caching for the Era of Hybrid LLMs

Authors:Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, Ravi Netravali

View PDF HTML (experimental)

Abstract:Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4$\times$ higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2411.19379 [cs.DC]
	(or arXiv:2411.19379v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2411.19379

Submission history

From: Rui Pan [view email]
[v1] Thu, 28 Nov 2024 21:10:20 UTC (1,767 KB)
[v2] Wed, 4 Dec 2024 18:40:24 UTC (1,767 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Marconi: Prefix Caching for the Era of Hybrid LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Marconi: Prefix Caching for the Era of Hybrid LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators