VoiceVector: Multimodal Enrolment Vectors for Speaker Separation

Rahimi, Akam; Afouras, Triantafyllos; Zisserman, Andrew

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2501.01401 (eess)

[Submitted on 2 Jan 2025]

Title:VoiceVector: Multimodal Enrolment Vectors for Speaker Separation

Authors:Akam Rahimi, Triantafyllos Afouras, Andrew Zisserman

View PDF HTML (experimental)

Abstract:We present a transformer-based architecture for voice separation of a target speaker from multiple other speakers and ambient noise. We achieve this by using two separate neural networks: (A) An enrolment network designed to craft speaker-specific embeddings, exploiting various combinations of audio and visual modalities; and (B) A separation network that accepts both the noisy signal and enrolment vectors as inputs, outputting the clean signal of the target speaker. The novelties are: (i) the enrolment vector can be produced from: audio only, audio-visual data (using lip movements) or visual data alone (using lip movements from silent video); and (ii) the flexibility in conditioning the separation on multiple positive and negative enrolment vectors. We compare with previous methods and obtain superior performance.

Subjects:	Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2501.01401 [eess.AS]
	(or arXiv:2501.01401v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2501.01401
Journal reference:	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)

Submission history

From: Akam Rahimi [view email]
[v1] Thu, 2 Jan 2025 18:25:27 UTC (658 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:VoiceVector: Multimodal Enrolment Vectors for Speaker Separation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:VoiceVector: Multimodal Enrolment Vectors for Speaker Separation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators