An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

Zhao, Xiangyu; Chen, Yicheng; Xu, Shilin; Li, Xiangtai; Wang, Xinjiang; Li, Yining; Huang, Haian

Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.02361 (cs)

[Submitted on 4 Jan 2024 (v1), last revised 5 Jan 2024 (this version, v2)]

Title:An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

Authors:Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang

View PDF HTML (experimental)

Abstract:Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community. Codes and trained models are released at this https URL.

Comments:	10 pages, 6 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2401.02361 [cs.CV]
	(or arXiv:2401.02361v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.02361

Submission history

From: Yicheng Chen [view email]
[v1] Thu, 4 Jan 2024 17:00:49 UTC (14,407 KB)
[v2] Fri, 5 Jan 2024 06:21:19 UTC (14,407 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators