Disentangle and denoise: Tackling context misalignment for video moment retrieval

Ma, Kaijing; Fang, Han; Zang, Xianghao; Ban, Chao; Zhou, Lanxiang; He, Zhongjiang; Li, Yongxiang; Sun, Hao; Feng, Zerun; Hou, Xingsong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.07600 (cs)

[Submitted on 14 Aug 2024]

Title:Disentangle and denoise: Tackling context misalignment for video moment retrieval

Authors:Kaijing Ma, Han Fang, Xianghao Zang, Chao Ban, Lanxiang Zhou, Zhongjiang He, Yongxiang Li, Hao Sun, Zerun Feng, Xingsong Hou

View PDF HTML (experimental)

Abstract:Video Moment Retrieval, which aims to locate in-context video moments according to a natural language query, is an essential task for cross-modal grounding. Existing methods focus on enhancing the cross-modal interactions between all moments and the textual description for video understanding. However, constantly interacting with all locations is unreasonable because of uneven semantic distribution across the timeline and noisy visual backgrounds. This paper proposes a cross-modal Context Denoising Network (CDNet) for accurate moment retrieval by disentangling complex correlations and denoising irrelevant this http URL, we propose a query-guided semantic disentanglement (QSD) to decouple video moments by estimating alignment levels according to the global and fine-grained correlation. A Context-aware Dynamic Denoisement (CDD) is proposed to enhance understanding of aligned spatial-temporal details by learning a group of query-relevant offsets. Extensive experiments on public benchmarks demonstrate that the proposed CDNet achieves state-of-the-art performances.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2408.07600 [cs.CV]
	(or arXiv:2408.07600v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.07600

Submission history

From: Kaijing Ma [view email]
[v1] Wed, 14 Aug 2024 15:00:27 UTC (10,636 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Disentangle and denoise: Tackling context misalignment for video moment retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Disentangle and denoise: Tackling context misalignment for video moment retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators