Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference

Recasens, Pol G.; Agullo, Ferran; Zhu, Yue; Wang, Chen; Lee, Eun Kyung; Tardieu, Olivier; Torres, Jordi; Berral, Josep Ll.

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2503.08311 (cs)

[Submitted on 11 Mar 2025]

Title:Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference

Authors:Pol G. Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, Josep Ll. Berral

View PDF HTML (experimental)

Abstract:Large language models have been widely adopted across different tasks, but their auto-regressive generation nature often leads to inefficient resource utilization during inference. While batching is commonly used to increase throughput, performance gains plateau beyond a certain batch size, especially with smaller models, a phenomenon that existing literature typically explains as a shift to the compute-bound regime. In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized due to DRAM bandwidth saturation as the primary bottleneck. To address this, we propose a Batching Configuration Advisor (BCA) that optimizes memory allocation, reducing GPU memory requirements with minimal impact on throughput. The freed memory and underutilized GPU compute capabilities can then be leveraged by concurrent workloads. Specifically, we use model replication to improve serving throughput and GPU utilization. Our findings challenge conventional assumptions about LLM inference, offering new insights and practical strategies for improving resource utilization, particularly for smaller language models.

Comments:	Pol G. Recasens, Ferran Agullo: equal contribution
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2503.08311 [cs.DC]
	(or arXiv:2503.08311v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2503.08311

Submission history

From: Pol Garcia Recasens [view email]
[v1] Tue, 11 Mar 2025 11:21:35 UTC (1,376 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators