African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification

Geigle, Gregor; Timofte, Radu; Glavaš, Goran

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.14496 (cs)

[Submitted on 20 Jun 2024]

Title:African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification

Authors:Gregor Geigle, Radu Timofte, Goran Glavaš

View PDF HTML (experimental)

Abstract:Recent Large Vision-Language Models (LVLMs) demonstrate impressive abilities on numerous image understanding and reasoning tasks. The task of fine-grained object classification (e.g., distinction between \textit{animal species}), however, has been probed insufficiently, despite its downstream importance. We fill this evaluation gap by creating \texttt{FOCI} (\textbf{F}ine-grained \textbf{O}bject \textbf{C}lass\textbf{I}fication), a difficult multiple-choice benchmark for fine-grained object classification, from existing object classification datasets: (1) multiple-choice avoids ambiguous answers associated with casting classification as open-ended QA task; (2) we retain classification difficulty by mining negative labels with a CLIP model. \texttt{FOCI}\xspace complements five popular classification datasets with four domain-specific subsets from ImageNet-21k. We benchmark 12 public LVLMs on \texttt{FOCI} and show that it tests for a \textit{complementary skill} to established image understanding and reasoning benchmarks. Crucially, CLIP models exhibit dramatically better performance than LVLMs. Since the image encoders of LVLMs come from these CLIP models, this points to inadequate alignment for fine-grained object distinction between the encoder and the LLM and warrants (pre)training data with more fine-grained annotation. We release our code at \url{this https URL}.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2406.14496 [cs.CV]
	(or arXiv:2406.14496v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.14496

Submission history

From: Gregor Geigle [view email]
[v1] Thu, 20 Jun 2024 16:59:39 UTC (8,851 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators