DS-TDNN: Dual-stream Time-delay Neural Network with Global-aware Filter for Speaker Verification

Li, Yangfu; Gan, Jiapan; Lin, Xiaodan

Computer Science > Sound

arXiv:2303.11020 (cs)

[Submitted on 20 Mar 2023 (v1), last revised 1 Aug 2023 (this version, v3)]

Title:DS-TDNN: Dual-stream Time-delay Neural Network with Global-aware Filter for Speaker Verification

Authors:Yangfu Li, Jiapan Gan, Xiaodan Lin

View PDF

Abstract:Conventional time-delay neural networks (TDNNs) struggle to handle long-range context, their ability to represent speaker information is therefore limited in long utterances. Existing solutions either depend on increasing model complexity or try to balance between local features and global context to address this issue. To effectively leverage the long-term dependencies of audio signals and constrain model complexity, we introduce a novel module called Global-aware Filter layer (GF layer) in this work, which employs a set of learnable transform-domain filters between a 1D discrete Fourier transform and its inverse transform to capture global context. Additionally, we develop a dynamic filtering strategy and a sparse regularization method to enhance the performance of the GF layer and prevent overfitting. Based on the GF layer, we present a dual-stream TDNN architecture called DS-TDNN for automatic speaker verification (ASV), which utilizes two unique branches to extract both local and global features in parallel and employs an efficient strategy to fuse different-scale information. Experiments on the Voxceleb and SITW databases demonstrate that the DS-TDNN achieves a relative improvement of 10\% together with a relative decline of 20\% in computational cost over the ECAPA-TDNN in speaker verification task. This improvement will become more evident as the utterance's duration grows. Furthermore, the DS-TDNN also beats popular deep residual models and attention-based systems on utterances of arbitrary length.

Comments:	13 pages 4 figures
Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
MSC classes:	68
ACM classes:	I.2.1
Cite as:	arXiv:2303.11020 [cs.SD]
	(or arXiv:2303.11020v3 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2303.11020

Submission history

From: Yangfu Li [view email]
[v1] Mon, 20 Mar 2023 10:58:12 UTC (4,856 KB)
[v2] Tue, 18 Apr 2023 04:32:23 UTC (5,711 KB)
[v3] Tue, 1 Aug 2023 07:09:50 UTC (7,617 KB)

Computer Science > Sound

Title:DS-TDNN: Dual-stream Time-delay Neural Network with Global-aware Filter for Speaker Verification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:DS-TDNN: Dual-stream Time-delay Neural Network with Global-aware Filter for Speaker Verification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators