Attention in SRAM on Tenstorrent Grayskull

Thüning, Moritz

Computer Science > Machine Learning

arXiv:2407.13885 (cs)

[Submitted on 18 Jul 2024]

Title:Attention in SRAM on Tenstorrent Grayskull

Authors:Moritz Thüning

View PDF HTML (experimental)

Abstract:When implementations of the Transformer's self-attention layer utilize SRAM instead of DRAM, they can achieve significant speedups. The Tenstorrent Grayskull architecture provides a large SRAM, distributed across a grid of cores. This work presents a fused kernel for Grayskull, that exclusively utilizes its large SRAM by combining matrix multiplication, attention score scaling and Softmax operations. Additionally, a dedicated Softmax kernel utilizing the SRAM and a CPU implementation serving as a baseline are presented. The Softmax operation consumes most of the runtime in the computation of attention weights from queries and keys on Grayskull. The speedup of the dedicated Softmax kernel compared to the CPU implementation is up to $10 \times$, and the Softmax implementation inside the fused kernel is approximately $1.8 \times$ faster than the dedicated Softmax kernel. The time and memory complexity of all implementations is quadratic in sequence length. Currently, the Grayskull e150 is approximately $30 \times$ cheaper for the general public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers approximately $1.5 \times$ more SRAM.

Comments:	8 pages, 6 figures, code: this https URL
Subjects:	Machine Learning (cs.LG); Performance (cs.PF)
ACM classes:	I.2.6
Cite as:	arXiv:2407.13885 [cs.LG]
	(or arXiv:2407.13885v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2407.13885

Submission history

From: Moritz Thüning [view email]
[v1] Thu, 18 Jul 2024 20:19:36 UTC (89 KB)

Computer Science > Machine Learning

Title:Attention in SRAM on Tenstorrent Grayskull

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Attention in SRAM on Tenstorrent Grayskull

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators