Revisiting Few-Shot Object Detection with Vision-Language Models

Madan, Anish; Peri, Neehar; Kong, Shu; Ramanan, Deva

Computer Science > Computer Vision and Pattern Recognition

arXiv:2312.14494 (cs)

[Submitted on 22 Dec 2023 (v1), last revised 14 Oct 2024 (this version, v4)]

Title:Revisiting Few-Shot Object Detection with Vision-Language Models

Authors:Anish Madan, Neehar Peri, Shu Kong, Deva Ramanan

View PDF HTML (experimental)

Abstract:The era of vision-language models (VLMs) trained on web-scale datasets challenges conventional formulations of "open-world" perception. In this work, we revisit the task of few-shot object detection (FSOD) in the context of recent foundational VLMs. First, we point out that zero-shot predictions from VLMs such as GroundingDINO significantly outperform state-of-the-art few-shot detectors (48 vs. 33 AP) on COCO. Despite their strong zero-shot performance, such foundation models may still be sub-optimal. For example, trucks on the web may be defined differently from trucks for a target application such as autonomous vehicle perception. We argue that the task of few-shot recognition can be reformulated as aligning foundation models to target concepts using a few examples. Interestingly, such examples can be multi-modal, using both text and visual cues, mimicking instructions that are often given to human annotators when defining a target concept of interest. Concretely, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data and fine-tuned on multi-modal (text and visual) K-shot examples per target class. We repurpose nuImages for Foundational FSOD, benchmark several popular open-source VLMs, and provide an empirical analysis of state-of-the-art methods. Lastly, we discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 23.3 mAP! Our code and dataset splits are available at this https URL

Comments:	The first two authors contributed equally. This work has been accepted to the Neural Information Processing Systems (NeurIPS) 2024 Datasets & Benchmark Track
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.14494 [cs.CV]
	(or arXiv:2312.14494v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.14494

Submission history

From: Neehar Peri [view email]
[v1] Fri, 22 Dec 2023 07:42:00 UTC (15,193 KB)
[v2] Sat, 20 Apr 2024 22:00:41 UTC (15,191 KB)
[v3] Fri, 14 Jun 2024 14:09:29 UTC (23,076 KB)
[v4] Mon, 14 Oct 2024 16:44:44 UTC (23,080 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Revisiting Few-Shot Object Detection with Vision-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Revisiting Few-Shot Object Detection with Vision-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators