Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Long, Yanxin; Han, Jianhua; Huang, Runhui; Hang, Xu; Zhu, Yi; Xu, Chunjing; Liang, Xiaodan

doi:10.1109/TNNLS.2023.3293484

Computer Science > Computer Vision and Pattern Recognition

arXiv:2211.00849 (cs)

[Submitted on 2 Nov 2022 (v1), last revised 29 Jul 2023 (this version, v2)]

Title:Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Authors:Yanxin Long, Jianhua Han, Runhui Huang, Xu Hang, Yi Zhu, Chunjing Xu, Xiaodan Liang

View PDF

Abstract:Inspired by the success of vision-language methods (VLMs) in zero-shot classification, recent works attempt to extend this line of work into object detection by leveraging the localization ability of pre-trained VLMs and generating pseudo labels for unseen classes in a self-training manner. However, since the current VLMs are usually pre-trained with aligning sentence embedding with global image embedding, the direct use of them lacks fine-grained alignment for object instances, which is the core of detection. In this paper, we propose a simple but effective fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD) that introduces a fine-grained visual-text prompt adapting stage to enhance the current self-training paradigm with a more powerful fine-grained alignment. During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task. Furthermore, we propose a visual prompt module to provide the prior task information (i.e., the categories need to be predicted) for the vision branch to better adapt the pre-trained VLM to the downstream tasks. Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2211.00849 [cs.CV]
	(or arXiv:2211.00849v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2211.00849
Related DOI:	https://doi.org/10.1109/TNNLS.2023.3293484

Submission history

From: Yanxin Long [view email]
[v1] Wed, 2 Nov 2022 03:38:02 UTC (43,801 KB)
[v2] Sat, 29 Jul 2023 17:46:25 UTC (46,951 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators