Multi-Modal Classifiers for Open-Vocabulary Object Detection

Kaul, Prannay; Xie, Weidi; Zisserman, Andrew

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.05493 (cs)

[Submitted on 8 Jun 2023]

Title:Multi-Modal Classifiers for Open-Vocabulary Object Detection

Authors:Prannay Kaul, Weidi Xie, Andrew Zisserman

View PDF

Abstract:The goal of this paper is open-vocabulary object detection (OVOD) $\unicode{x2013}$ building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two-stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yielding a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector.

Comments:	ICML 2023, project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
ACM classes:	I.4.6; I.4.8; I.4.9; I.2.10
Cite as:	arXiv:2306.05493 [cs.CV]
	(or arXiv:2306.05493v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.05493

Submission history

From: Prannay Kaul [view email]
[v1] Thu, 8 Jun 2023 18:31:56 UTC (12,044 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-Modal Classifiers for Open-Vocabulary Object Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-Modal Classifiers for Open-Vocabulary Object Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators