$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction

Wu, Wenxuan; Chen, Xueyuan; Wang, Shuai; Wang, Jiadong; Meng, Lingwei; Wu, Xixin; Meng, Helen; Li, Haizhou

Computer Science > Sound

arXiv:2504.00750 (cs)

[Submitted on 1 Apr 2025]

Title:$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction

Authors:Wenxuan Wu, Xueyuan Chen, Shuai Wang, Jiadong Wang, Lingwei Meng, Xixin Wu, Helen Meng, Haizhou Li

View PDF HTML (experimental)

Abstract:Audio-Visual Target Speaker Extraction (AV-TSE) aims to mimic the human ability to enhance auditory perception using visual cues. Although numerous models have been proposed recently, most of them estimate target signals by primarily relying on local dependencies within acoustic features, underutilizing the human-like capacity to infer unclear parts of speech through contextual information. This limitation results in not only suboptimal performance but also inconsistent extraction quality across the utterance, with some segments exhibiting poor quality or inadequate suppression of interfering speakers. To close this gap, we propose a model-agnostic strategy called the Mask-And-Recover (MAR). It integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. Additionally, to better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model to assess extraction quality and guide extraction modules to emphasize improvement on low-quality segments. To validate the effectiveness of our proposed model-agnostic training paradigm, six popular AV-TSE backbones were adopted for evaluation on the VoxCeleb2 dataset, demonstrating consistent performance improvements across various metrics.

Comments:	Accepted by IEEE Journal of Selected Topics in Signal Processing (JSTSP)
Subjects:	Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2504.00750 [cs.SD]
	(or arXiv:2504.00750v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2504.00750

Submission history

From: Wenxuan Wu [view email]
[v1] Tue, 1 Apr 2025 13:01:30 UTC (24,747 KB)

Computer Science > Sound

Title:$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators