Multi-Context Temporal Consistent Modeling for Referring Video Object Segmentation

Choi, Sun-Hyuk; Jo, Hayoung; Lee, Seong-Whan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.04939 (cs)

[Submitted on 9 Jan 2025]

Title:Multi-Context Temporal Consistent Modeling for Referring Video Object Segmentation

Authors:Sun-Hyuk Choi, Hayoung Jo, Seong-Whan Lee

View PDF HTML (experimental)

Abstract:Referring video object segmentation aims to segment objects within a video corresponding to a given text description. Existing transformer-based temporal modeling approaches face challenges related to query inconsistency and the limited consideration of context. Query inconsistency produces unstable masks of different objects in the middle of the video. The limited consideration of context leads to the segmentation of incorrect objects by failing to adequately account for the relationship between the given text and instances. To address these issues, we propose the Multi-context Temporal Consistency Module (MTCM), which consists of an Aligner and a Multi-Context Enhancer (MCE). The Aligner removes noise from queries and aligns them to achieve query consistency. The MCE predicts text-relevant queries by considering multi-context. We applied MTCM to four different models, increasing performance across all of them, particularly achieving 47.6 J&F on the MeViS. Code is available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.04939 [cs.CV]
	(or arXiv:2501.04939v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.04939

Submission history

From: Sun-Hyuk Choi [view email]
[v1] Thu, 9 Jan 2025 03:04:08 UTC (3,694 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-Context Temporal Consistent Modeling for Referring Video Object Segmentation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-Context Temporal Consistent Modeling for Referring Video Object Segmentation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators