Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Owens, Andrew; Efros, Alexei A.

Computer Science > Computer Vision and Pattern Recognition

arXiv:1804.03641 (cs)

[Submitted on 10 Apr 2018 (v1), last revised 9 Oct 2018 (this version, v2)]

Title:Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Authors:Andrew Owens, Alexei A. Efros

View PDF

Abstract:The thud of a bouncing ball, the onset of speech as lips open -- when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/off-screen audio source separation, e.g. removing the off-screen translator's voice from a foreign official's speech. Code, models, and video results are available on our webpage: this http URL

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:1804.03641 [cs.CV]
	(or arXiv:1804.03641v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1804.03641

Submission history

From: Andrew Owens [view email]
[v1] Tue, 10 Apr 2018 17:36:50 UTC (7,742 KB)
[v2] Tue, 9 Oct 2018 07:15:29 UTC (9,427 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2018-04

Change to browse by:

cs
cs.SD
eess
eess.AS

References & Citations

DBLP - CS Bibliography

listing | bibtex

Andrew Owens
Alexei A. Efros

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators