A Tensor-Train Decomposition based Compression of LLMs on Group Vector Systolic Accelerator

Huang, Sixiao; Wang, Tintin; Li, Ang; Shen, Ao; Li, Kai; Jiang, Keyao; Huang, Mingqiang; Yu, Hao

Computer Science > Hardware Architecture

arXiv:2501.19135 (cs)

[Submitted on 31 Jan 2025]

Title:A Tensor-Train Decomposition based Compression of LLMs on Group Vector Systolic Accelerator

Authors:Sixiao Huang, Tintin Wang, Ang Li, Ao Shen, Kai Li, Keyao Jiang, Mingqiang Huang, Hao Yu

View PDF HTML (experimental)

Abstract:Large language models (LLMs) are both storage-intensive and computation-intensive, posing significant challenges when deployed on resource-constrained hardware. As linear layers in LLMs are mainly resource consuming parts, this paper develops a tensor-train decomposition (TTD) for LLMs with a further hardware implementation on FPGA. TTD compression is applied to the linear layers in ChatGLM3-6B and LLaMA2-7B models with compression ratios (CRs) for the whole network 1.94$\times$ and 1.60$\times$, respectively. The compressed LLMs are further implemented on FPGA hardware within a highly efficient group vector systolic array (GVSA) architecture, which has DSP-shared parallel vector PEs for TTD inference, as well as optimized data communication in the accelerator. Experimental results show that the corresponding TTD based LLM accelerator implemented on FPGA achieves 1.45$\times$ and 1.57$\times$ reduction in first token delay for ChatGLM3-6B and LLaMA2-7B models, respectively.

Subjects:	Hardware Architecture (cs.AR)
Cite as:	arXiv:2501.19135 [cs.AR]
	(or arXiv:2501.19135v1 [cs.AR] for this version)
	https://doi.org/10.48550/arXiv.2501.19135

Submission history

From: Sixiao Huang [view email]
[v1] Fri, 31 Jan 2025 13:45:31 UTC (225 KB)

Computer Science > Hardware Architecture

Title:A Tensor-Train Decomposition based Compression of LLMs on Group Vector Systolic Accelerator

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Hardware Architecture

Title:A Tensor-Train Decomposition based Compression of LLMs on Group Vector Systolic Accelerator

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators