S$^3$Attention: Improving Long Sequence Attention with Smoothed Skeleton Sketching

Wang, Xue; Zhou, Tian; Zhu, Jianqing; Liu, Jialin; Yuan, Kun; Yao, Tao; Yin, Wotao; Jin, Rong; Cai, HanQin

doi:10.1109/JSTSP.2024.3446173

Computer Science > Machine Learning

arXiv:2408.08567 (cs)

[Submitted on 16 Aug 2024 (v1), last revised 17 Sep 2024 (this version, v3)]

Title:S$^3$Attention: Improving Long Sequence Attention with Smoothed Skeleton Sketching

Authors:Xue Wang, Tian Zhou, Jianqing Zhu, Jialin Liu, Kun Yuan, Tao Yao, Wotao Yin, Rong Jin, HanQin Cai

View PDF HTML (experimental)

Abstract:Attention based models have achieved many remarkable breakthroughs in numerous applications. However, the quadratic complexity of Attention makes the vanilla Attention based models hard to apply to long sequence tasks. Various improved Attention structures are proposed to reduce the computation cost by inducing low rankness and approximating the whole sequence by sub-sequences. The most challenging part of those approaches is maintaining the proper balance between information preservation and computation reduction: the longer sub-sequences used, the better information is preserved, but at the price of introducing more noise and computational costs. In this paper, we propose a smoothed skeleton sketching based Attention structure, coined S$^3$Attention, which significantly improves upon the previous attempts to negotiate this trade-off. S$^3$Attention has two mechanisms to effectively minimize the impact of noise while keeping the linear complexity to the sequence length: a smoothing block to mix information over long sequences and a matrix sketching method that simultaneously selects columns and rows from the input matrix. We verify the effectiveness of S$^3$Attention both theoretically and empirically. Extensive studies over Long Range Arena (LRA) datasets and six time-series forecasting show that S$^3$Attention significantly outperforms both vanilla Attention and other state-of-the-art variants of Attention structures.

Subjects:	Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
Cite as:	arXiv:2408.08567 [cs.LG]
	(or arXiv:2408.08567v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2408.08567
Related DOI:	https://doi.org/10.1109/JSTSP.2024.3446173

Submission history

From: HanQin Cai [view email]
[v1] Fri, 16 Aug 2024 07:01:46 UTC (922 KB)
[v2] Fri, 23 Aug 2024 04:53:11 UTC (922 KB)
[v3] Tue, 17 Sep 2024 17:30:46 UTC (922 KB)

Computer Science > Machine Learning

Title:S$^3$Attention: Improving Long Sequence Attention with Smoothed Skeleton Sketching

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:S$^3$Attention: Improving Long Sequence Attention with Smoothed Skeleton Sketching

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators