Seesaw: High-throughput LLM Inference via Model Re-sharding

Su, Qidong; Zhao, Wei; Li, Xin; Andoorveedu, Muralidhar; Jiang, Chenhao; Zhu, Zhanda; Song, Kevin; Giannoula, Christina; Pekhimenko, Gennady

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2503.06433 (cs)

[Submitted on 9 Mar 2025]

Title:Seesaw: High-throughput LLM Inference via Model Re-sharding

Authors:Qidong Su, Wei Zhao, Xin Li, Muralidhar Andoorveedu, Chenhao Jiang, Zhanda Zhu, Kevin Song, Christina Giannoula, Gennady Pekhimenko

View PDF HTML (experimental)

Abstract:To improve the efficiency of distributed large language model (LLM) inference, various parallelization strategies, such as tensor and pipeline parallelism, have been proposed. However, the distinct computational characteristics inherent in the two stages of LLM inference-prefilling and decoding-render a single static parallelization strategy insufficient for the effective optimization of both stages. In this work, we present Seesaw, an LLM inference engine optimized for throughput-oriented tasks. The key idea behind Seesaw is dynamic model re-sharding, a technique that facilitates the dynamic reconfiguration of parallelization strategies across stages, thereby maximizing throughput at both phases. To mitigate re-sharding overhead and optimize computational efficiency, we employ tiered KV cache buffering and transition-minimizing scheduling. These approaches work synergistically to reduce the overhead caused by frequent stage transitions while ensuring maximum batching efficiency. Our evaluation demonstrates that Seesaw achieves a throughput increase of up to 1.78x (1.36x on average) compared to vLLM, the most widely used state-of-the-art LLM inference engine.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2503.06433 [cs.DC]
	(or arXiv:2503.06433v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2503.06433

Submission history

From: Qidong Su [view email]
[v1] Sun, 9 Mar 2025 04:14:06 UTC (711 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Seesaw: High-throughput LLM Inference via Model Re-sharding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Seesaw: High-throughput LLM Inference via Model Re-sharding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators