DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners

Bhati, Saurabhchand; Gong, Yuan; Karlinsky, Leonid; Kuehne, Hilde; Feris, Rogerio; Glass, James

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2407.04082 (eess)

[Submitted on 4 Jul 2024]

Title:DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners

Authors:Saurabhchand Bhati, Yuan Gong, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass

View PDF HTML (experimental)

Abstract:State-space models (SSMs) have emerged as an alternative to Transformers for audio modeling due to their high computational efficiency with long inputs. While recent efforts on Audio SSMs have reported encouraging results, two main limitations remain: First, in 10-second short audio tagging tasks, Audio SSMs still underperform compared to Transformer-based models such as Audio Spectrogram Transformer (AST). Second, although Audio SSMs theoretically support long audio inputs, their actual performance with long audio has not been thoroughly evaluated. To address these limitations, in this paper, 1) We applied knowledge distillation in audio space model training, resulting in a model called Knowledge Distilled Audio SSM (DASS). To the best of our knowledge, it is the first SSM that outperforms the Transformers on AudioSet and achieves an mAP of 47.6; and 2) We designed a new test called Audio Needle In A Haystack (Audio NIAH). We find that DASS, trained with only 10-second audio clips, can retrieve sound events in audio recordings up to 2.5 hours long, while the AST model fails when the input is just 50 seconds, demonstrating SSMs are indeed more duration scalable.

Subjects:	Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2407.04082 [eess.AS]
	(or arXiv:2407.04082v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2407.04082

Submission history

From: Saurabhchand Bhati [view email]
[v1] Thu, 4 Jul 2024 17:46:19 UTC (826 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators