Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance

Lee, Taehan; Lee, Hyukjun

Computer Science > Sound

arXiv:2504.01690 (cs)

[Submitted on 2 Apr 2025]

Title:Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance

Authors:Taehan Lee, Hyukjun Lee

View PDF HTML (experimental)

Abstract:Vision Transformers (ViTs) have achieved state-of-the-art performance across various computer vision tasks, but their high computational cost remains a challenge. Token pruning has been proposed to reduce this cost by selectively removing less important tokens. While effective in vision tasks by discarding non-object regions, applying this technique to audio tasks presents unique challenges, as distinguishing relevant from irrelevant regions in time-frequency representations is less straightforward. In this study, for the first time, we applied token pruning to ViT-based audio classification models using Mel-spectrograms and analyzed the trade-offs between model performance and computational cost: TopK token pruning can reduce MAC operations of AudioMAE and AST by 30-40%, with less than a 1% drop in classification accuracy. Our analysis reveals that while high-intensity tokens contribute significantly to model accuracy, low-intensity tokens remain important. In particular, they play a more critical role in general audio classification tasks than in speech-specific tasks.

Comments:	This work has been submitted to the IEEE for possible publication. Source code is available at this https URL
Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2504.01690 [cs.SD]
	(or arXiv:2504.01690v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2504.01690

Submission history

From: Taehan Lee [view email]
[v1] Wed, 2 Apr 2025 12:44:38 UTC (1,737 KB)

Computer Science > Sound

Title:Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators