Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets

Schmid, Florian; Primus, Paul; Morocutti, Tobias; Greif, Jonathan; Widmer, Gerhard

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2407.12997 (eess)

[Submitted on 17 Jul 2024]

Title:Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets

Authors:Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard Widmer

View PDF HTML (experimental)

Abstract:A central problem in building effective sound event detection systems is the lack of high-quality, strongly annotated sound event datasets. For this reason, Task 4 of the DCASE 2024 challenge proposes learning from two heterogeneous datasets, including audio clips labeled with varying annotation granularity and with different sets of possible events. We propose a multi-iteration, multi-stage procedure for fine-tuning Audio Spectrogram Transformers on the joint DESED and MAESTRO Real datasets. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, achieving a new single-model, state-of-the-art performance on the public evaluation set of DESED with a PSDS1 of 0.692. A single model and an ensemble, both based on our proposed training procedure, ranked first in Task 4 of the DCASE Challenge 2024.

Comments:	Code: this https URL
Subjects:	Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2407.12997 [eess.AS]
	(or arXiv:2407.12997v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2407.12997

Submission history

From: Florian Schmid [view email]
[v1] Wed, 17 Jul 2024 20:32:58 UTC (1,602 KB)

✅2024-10-01: arxiv.org is back to normal.✅

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

✅2024-10-01: arxiv.org is back to normal.✅

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators