4th PVUW MeViS 3rd Place Report: Sa2VA

Yuan, Haobo; Zhang, Tao; Li, Xiangtai; Qi, Lu; Huang, Zilong; Xu, Shilin; Feng, Jiashi; Yang, Ming-Hsuan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.00476 (cs)

[Submitted on 1 Apr 2025]

Title:4th PVUW MeViS 3rd Place Report: Sa2VA

Authors:Haobo Yuan, Tao Zhang, Xiangtai Li, Lu Qi, Zilong Huang, Shilin Xu, Jiashi Feng, Ming-Hsuan Yang

View PDF HTML (experimental)

Abstract:Referring video object segmentation (RVOS) is a challenging task that requires the model to segment the object in a video given the language description. MeViS is a recently proposed dataset that contains motion expressions of the target objects, leading to a challenging benchmark, compared with existing RVOS benchmarks. On the other hand, for referring expression tasks, a new trend is to adopt multi-modal large language model (MLLM) to achieve better image and text alignment. In this report, we show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS. In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos. By enlarging the scope of key frames, without any further training, we can achieve the 3rd place in the 4th PVUW workshop.

Comments:	Technical Report, 4 pages, Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2504.00476 [cs.CV]
	(or arXiv:2504.00476v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.00476

Submission history

From: Haobo Yuan [view email]
[v1] Tue, 1 Apr 2025 07:06:47 UTC (4,154 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:4th PVUW MeViS 3rd Place Report: Sa2VA

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:4th PVUW MeViS 3rd Place Report: Sa2VA

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators