InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

Wei, Cong; Zhong, Yujie; Tan, Haoxian; Zeng, Yingsen; Liu, Yong; Zhao, Zheng; Yang, Yujiu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.14006 (cs)

[Submitted on 18 Dec 2024]

Title:InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

Authors:Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Zheng Zhao, Yujiu Yang

View PDF HTML (experimental)

Abstract:Boosted by Multi-modal Large Language Models (MLLMs), text-guided universal segmentation models for the image and video domains have made rapid progress recently. However, these methods are often developed separately for specific domains, overlooking the similarities in task settings and solutions across these two areas. In this paper, we define the union of referring segmentation and reasoning segmentation at both the image and video levels as Instructed Visual Segmentation (IVS). Correspondingly, we propose InstructSeg, an end-to-end segmentation pipeline equipped with MLLMs for IVS. Specifically, we employ an object-aware video perceiver to extract temporal and object information from reference frames, facilitating comprehensive video understanding. Additionally, we introduce vision-guided multi-granularity text fusion to better integrate global and detailed text information with fine-grained visual guidance. By leveraging multi-task and end-to-end training, InstructSeg demonstrates superior performance across diverse image and video segmentation tasks, surpassing both segmentation specialists and MLLM-based methods with a single model. Our code is available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2412.14006 [cs.CV]
	(or arXiv:2412.14006v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.14006

Submission history

From: Cong Wei [view email]
[v1] Wed, 18 Dec 2024 16:20:40 UTC (8,790 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators