Unsupervised active speaker detection in media content using cross-modal information

Sharma, Rahul; Narayanan, Shrikanth

Electrical Engineering and Systems Science > Image and Video Processing

arXiv:2209.11896 (eess)

[Submitted on 24 Sep 2022]

Title:Unsupervised active speaker detection in media content using cross-modal information

Authors:Rahul Sharma, Shrikanth Narayanan

View PDF

Abstract:We present a cross-modal unsupervised framework for active speaker detection in media content such as TV shows and movies. Machine learning advances have enabled impressive performance in identifying individuals from speech and facial images. We leverage speaker identity information from speech and faces, and formulate active speaker detection as a speech-face assignment task such that the active speaker's face and the underlying speech identify the same person (character). We express the speech segments in terms of their associated speaker identity distances, from all other speech segments, to capture a relative identity structure for the video. Then we assign an active speaker's face to each speech segment from the concurrently appearing faces such that the obtained set of active speaker faces displays a similar relative identity structure. Furthermore, we propose a simple and effective approach to address speech segments where speakers are present off-screen. We evaluate the proposed system on three benchmark datasets -- Visual Person Clustering dataset, AVA-active speaker dataset, and Columbia dataset -- consisting of videos from entertainment and broadcast media, and show competitive performance to state-of-the-art fully supervised methods.

Comments:	Under review at IEEE Transactions on Image Processing
Subjects:	Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2209.11896 [eess.IV]
	(or arXiv:2209.11896v1 [eess.IV] for this version)
	https://doi.org/10.48550/arXiv.2209.11896

Submission history

From: Rahul Sharma [view email]
[v1] Sat, 24 Sep 2022 00:51:38 UTC (6,397 KB)

Electrical Engineering and Systems Science > Image and Video Processing

Title:Unsupervised active speaker detection in media content using cross-modal information

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Image and Video Processing

Title:Unsupervised active speaker detection in media content using cross-modal information

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators