Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation

Agarwal, Shubham; Sundaresan, Sai; Mitra, Subrata; Mahapatra, Debabrata; Gupta, Archit; Sharma, Rounak; Kapu, Nirmal Joshua; Yu, Tong; Saini, Shiv

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2502.15734 (cs)

[Submitted on 5 Feb 2025]

Title:Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation

Authors:Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, Shiv Saini

View PDF HTML (experimental)

Abstract:Retrieval-Augmented Generation (RAG) is often used with Large Language Models (LLMs) to infuse domain knowledge or user-specific information. In RAG, given a user query, a retriever extracts chunks of relevant text from a knowledge base. These chunks are sent to an LLM as part of the input prompt. Typically, any given chunk is repeatedly retrieved across user questions. However, currently, for every question, attention-layers in LLMs fully compute the key values (KVs) repeatedly for the input chunks, as state-of-the-art methods cannot reuse KV-caches when chunks appear at arbitrary locations with arbitrary contexts. Naive reuse leads to output quality degradation. This leads to potentially redundant computations on expensive GPUs and increases latency. In this work, we propose Cache-Craft, a system for managing and reusing precomputed KVs corresponding to the text chunks (we call chunk-caches) in RAG-based systems. We present how to identify chunk-caches that are reusable, how to efficiently perform a small fraction of recomputation to fix the cache to maintain output quality, and how to efficiently store and evict chunk-caches in the hardware for maximizing reuse while masking any overheads. With real production workloads as well as synthetic datasets, we show that Cache-Craft reduces redundant computation by 51% over SOTA prefix-caching and 75% over full recomputation. Additionally, with continuous batching on a real production workload, we get a 1.6X speed up in throughput and a 2X reduction in end-to-end response latency over prefix-caching while maintaining quality, for both the LLaMA-3-8B and LLaMA-3-70B models.

Comments:	Accepted at SIGMOD 2025
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Operating Systems (cs.OS)
Cite as:	arXiv:2502.15734 [cs.DC]
	(or arXiv:2502.15734v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2502.15734

Submission history

From: Shubham Agarwal [view email]
[v1] Wed, 5 Feb 2025 14:12:33 UTC (6,416 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators