Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization

Jiang, Yuanyuan; Yin, Jianqin; Dang, Yonghao

doi:10.1109/TMM.2023.3324498

Computer Science > Computer Vision and Pattern Recognition

arXiv:2210.05242 (cs)

[Submitted on 11 Oct 2022 (v1), last revised 20 Oct 2023 (this version, v2)]

Title:Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization

Authors:Yuanyuan Jiang, Jianqin Yin, Yonghao Dang

View PDF

Abstract:Audio-visual event (AVE) localization has attracted much attention in recent years. Most existing methods are often limited to independently encoding and classifying each video segment separated from the full video (which can be regarded as the segment-level representations of events). However, they ignore the semantic consistency of the event within the same full video (which can be considered as the video-level representations of events). In contrast to existing methods, we propose a novel video-level semantic consistency guidance network for the AVE localization task. Specifically, we propose an event semantic consistency modeling (ESCM) module to explore video-level semantic information for semantic consistency modeling. It consists of two components: a cross-modal event representation extractor (CERE) and an intra-modal semantic consistency enhancer (ISCE). CERE is proposed to obtain the event semantic information at the video level. Furthermore, ISCE takes video-level event semantics as prior knowledge to guide the model to focus on the semantic continuity of an event within each modality. Moreover, we propose a new negative pair filter loss to encourage the network to filter out the irrelevant segment pairs and a new smooth loss to further increase the gap between different categories of events in the weakly-supervised setting. We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings, thus verifying the effectiveness of our this http URL code is available at this https URL.

Comments:	13 pages, 10 figures, Accepted by IEEE Transactions on Multimedia
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2210.05242 [cs.CV]
	(or arXiv:2210.05242v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2210.05242
Related DOI:	https://doi.org/10.1109/TMM.2023.3324498

Submission history

From: Yuanyuan Jiang [view email]
[v1] Tue, 11 Oct 2022 08:15:57 UTC (2,569 KB)
[v2] Fri, 20 Oct 2023 08:48:11 UTC (3,829 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators