Jenga: Effective Memory Management for Serving LLM with Heterogeneity

Zhang, Chen; Du, Kuntai; Liu, Shu; Kwon, Woosuk; Mo, Xiangxi; Wang, Yufeng; Liu, Xiaoxuan; You, Kaichao; Li, Zhuohan; Long, Mingsheng; Zhai, Jidong; Gonzalez, Joseph; Stoica, Ion

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2503.18292 (cs)

[Submitted on 24 Mar 2025]

Title:Jenga: Effective Memory Management for Serving LLM with Heterogeneity

Authors:Chen Zhang, Kuntai Du, Shu Liu, Woosuk Kwon, Xiangxi Mo, Yufeng Wang, Xiaoxuan Liu, Kaichao You, Zhuohan Li, Mingsheng Long, Jidong Zhai, Joseph Gonzalez, Ion Stoica

View PDF HTML (experimental)

Abstract:Large language models (LLMs) are widely used but expensive to run, especially as inference workloads grow. To lower costs, maximizing the request batch size by managing GPU memory efficiently is crucial. While PagedAttention has recently been proposed to improve the efficiency of memory management, we find that the growing heterogeneity in the embeddings dimensions, attention, and access patterns of modern LLM architectures introduces new challenges for memory allocation.
In this paper, we present Jenga, a novel memory allocation framework for heterogeneous embeddings in LLMs. Jenga tackles two key challenges: (1) minimizing memory fragmentation when managing embeddings of different sizes, and (2) enabling flexible caching and eviction policies tailored to the specific token-dependency patterns of various layers. Jenga employs a two-level memory allocator, leveraging the least common multiple (LCM) of embedding sizes to optimize memory usage and providing APIs to express layer-specific caching logic to enhance memory reuse.
We implemente Jenga on vLLM, a state-of-the-art LLM inference engine, and evaluate it with diverse LLMs, datasets, and GPU configurations. Evaluations show that Jenga improves GPU memory utilization by up to 79.6%, and increases serving throughput by up to 4.92x (1.80x on average).

Comments:	16 pages, 19 figures
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2503.18292 [cs.DC]
	(or arXiv:2503.18292v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2503.18292

Submission history

From: Chen Zhang [view email]
[v1] Mon, 24 Mar 2025 02:28:04 UTC (5,817 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Jenga: Effective Memory Management for Serving LLM with Heterogeneity

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Jenga: Effective Memory Management for Serving LLM with Heterogeneity

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators