Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Xu, Yifang; Sun, Yunzhuo; Zhai, Benxiang; Xie, Zien; Jia, Youyao; Du, Sidan

doi:10.1109/ICME57554.2024.10687844

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.10692 (cs)

[Submitted on 18 Jan 2025]

Title:Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Authors:Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Zien Xie, Youyao Jia, Sidan Du

View PDF HTML (experimental)

Abstract:Given a video and a linguistic query, video moment retrieval and highlight detection (MR&HD) aim to locate all the relevant spans while simultaneously predicting saliency scores. Most existing methods utilize RGB images as input, overlooking the inherent multi-modal visual signals like optical flow and depth. In this paper, we propose a Multi-modal Fusion and Query Refinement Network (MRNet) to learn complementary information from multi-modal cues. Specifically, we design a multi-modal fusion module to dynamically combine RGB, optical flow, and depth map. Furthermore, to simulate human understanding of sentences, we introduce a query refinement module that merges text at different granularities, containing word-, phrase-, and sentence-wise levels. Comprehensive experiments on QVHighlights and Charades datasets indicate that MRNet outperforms current state-of-the-art methods, achieving notable improvements in MR-mAP@Avg (+3.41) and HD-HIT@1 (+3.46) on QVHighlights.

Comments:	Accepted by ICME 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.10692 [cs.CV]
	(or arXiv:2501.10692v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.10692
Related DOI:	https://doi.org/10.1109/ICME57554.2024.10687844

Submission history

From: Yifang Xu [view email]
[v1] Sat, 18 Jan 2025 08:09:44 UTC (2,231 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators