Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Liu, Daizong; Qu, Xiaoye; Wang, Yinzhen; Di, Xing; Zou, Kai; Cheng, Yu; Xu, Zichuan; Zhou, Pan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2201.05307 (cs)

[Submitted on 14 Jan 2022]

Title:Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Authors:Daizong Liu, Xiaoye Qu, Yinzhen Wang, Xing Di, Kai Zou, Yu Cheng, Zichuan Xu, Pan Zhou

View PDF

Abstract:Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any paired annotations. To the best of our knowledge, this paper is the first work trying to address TVG in an unsupervised setting. Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set to compose the possible activity in each video for grounding. Specifically, we first develop a language semantic mining module, which extracts implicit semantic features from the whole query set. Then, these language semantic features serve as the guidance to compose the activity in video via a video-based semantic aggregation module. Finally, we utilize a foreground attention branch to filter out the redundant background activities and refine the grounding results. To validate the effectiveness of our DSCNet, we conduct experiments on both ActivityNet Captions and Charades-STA datasets. The results demonstrate that DSCNet achieves competitive performance, and even outperforms most weakly-supervised approaches.

Comments:	Accepted by AAAI2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2201.05307 [cs.CV]
	(or arXiv:2201.05307v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2201.05307

Submission history

From: Daizong Liu [view email]
[v1] Fri, 14 Jan 2022 05:16:33 UTC (4,943 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators