VoxBlink: A Large Scale Speaker Verification Dataset on Camera

Lin, Yuke; Qin, Xiaoyi; Zhao, Guoqing; Cheng, Ming; Jiang, Ning; Wu, Haiyang; Li, Ming

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2308.07056 (eess)

[Submitted on 14 Aug 2023 (v1), last revised 13 Dec 2023 (this version, v7)]

Title:VoxBlink: A Large Scale Speaker Verification Dataset on Camera

Authors:Yuke Lin, Xiaoyi Qin, Guoqing Zhao, Ming Cheng, Ning Jiang, Haiyang Wu, Ming Li

View PDF HTML (experimental)

Abstract:In this paper, we introduce a large-scale and high-quality audio-visual speaker verification dataset, named VoxBlink. We propose an innovative and robust automatic audio-visual data mining pipeline to curate this dataset, which contains 1.45M utterances from 38K speakers. Due to the inherent nature of automated data collection, introducing noisy data is inevitable. Therefore, we also utilize a multi-modal purification step to generate a cleaner version of the VoxBlink, named VoxBlink-clean, comprising 18K identities and 1.02M utterances. In contrast to the VoxCeleb, the VoxBlink sources from short videos of ordinary users, and the covered scenarios can better align with real-life situations. To our best knowledge, the VoxBlink dataset is one of the largest publicly available speaker verification datasets. Leveraging the VoxCeleb and VoxBlink-clean datasets together, we employ diverse speaker verification models with multiple architectural backbones to conduct comprehensive evaluations on the VoxCeleb test sets. Experimental results indicate a substantial enhancement in performance,ranging from 12% to 30% relatively, across various backbone architectures upon incorporating the VoxBlink-clean into the training process. The details of the dataset can be found on this http URL

Comments:	Accepted By ICASSP2024
Subjects:	Audio and Speech Processing (eess.AS); Multimedia (cs.MM); Sound (cs.SD)
Cite as:	arXiv:2308.07056 [eess.AS]
	(or arXiv:2308.07056v7 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2308.07056

Submission history

From: Yuke Lin [view email]
[v1] Mon, 14 Aug 2023 10:31:29 UTC (778 KB)
[v2] Wed, 16 Aug 2023 01:58:26 UTC (949 KB)
[v3] Sun, 20 Aug 2023 14:39:06 UTC (978 KB)
[v4] Wed, 23 Aug 2023 06:39:08 UTC (1,005 KB)
[v5] Fri, 8 Sep 2023 01:51:10 UTC (979 KB)
[v6] Fri, 15 Sep 2023 06:01:14 UTC (2,132 KB)
[v7] Wed, 13 Dec 2023 02:24:37 UTC (2,132 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:VoxBlink: A Large Scale Speaker Verification Dataset on Camera

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:VoxBlink: A Large Scale Speaker Verification Dataset on Camera

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators