DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference

Yao, Jinwei; Chen, Kaiqi; Zhang, Kexun; You, Jiaxuan; Yuan, Binhang; Wang, Zeke; Lin, Tao

Computer Science > Computation and Language

arXiv:2404.00242 (cs)

[Submitted on 30 Mar 2024 (v1), last revised 7 Mar 2025 (this version, v4)]

Title:DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference

Authors:Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin

View PDF HTML (experimental)

Abstract:Large language models (LLMs) are increasingly employed for complex tasks that process multiple generation calls in a tree structure with shared prefixes of tokens, including few-shot prompting, multi-step reasoning, speculative decoding, etc. However, existing inference systems for tree-based applications are inefficient due to improper partitioning of queries and KV cache during attention calculation. This leads to two main issues: (1) a lack of memory access (IO) reuse for KV cache of shared prefixes, and (2) poor load this http URL a result, there is redundant KV cache IO between GPU global memory and shared memory, along with low GPU utilization. To address these challenges, we propose DeFT(Decoding with Flash Tree-Attention), a hardware-efficient attention algorithm with prefix-aware and load-balanced KV cache partitions. DeFT reduces the number of read/write operations of KV cache during attention calculation through KV-Guided Grouping, a method that avoids repeatedly loading KV cache of shared prefixes in attention computation. Additionally, we propose Flattened Tree KV Splitting, a mechanism that ensures even distribution of the KV cache across partitions with little computation redundancy, enhancing GPU utilization during attention computations. By reducing 73-99% KV cache IO and nearly 100% IO for partial results during attention calculation, DeFT achieves up to 2.23/3.59x speedup in the end-to-end/attention latency across three practical tree-based workloads compared to state-of-the-art attention algorithms. Our code is available at this https URL.

Comments:	Update DeFT-v4, accepted by ICLR'25 (this https URL). Our code is available at this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2404.00242 [cs.CL]
	(or arXiv:2404.00242v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2404.00242

Submission history

From: Jinwei Yao [view email]
[v1] Sat, 30 Mar 2024 04:34:54 UTC (5,798 KB)
[v2] Wed, 29 May 2024 18:46:41 UTC (2,586 KB)
[v3] Thu, 3 Oct 2024 22:17:01 UTC (2,467 KB)
[v4] Fri, 7 Mar 2025 17:47:42 UTC (2,556 KB)

Computer Science > Computation and Language

Title:DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators