Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Geigle, Gregor; Timofte, Radu; Glavaš, Goran

Computer Science > Computation and Language

arXiv:2306.08658 (cs)

[Submitted on 14 Jun 2023 (v1), last revised 12 Jun 2024 (this version, v2)]

Title:Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Authors:Gregor Geigle, Radu Timofte, Goran Glavaš

View PDF HTML (experimental)

Abstract:Vision-and-language (VL) models with separate encoders for each modality (e.g., CLIP) have become the go-to models for zero-shot image classification and image-text retrieval. They are, however, mostly evaluated in English as multilingual benchmarks are limited in availability. We introduce Babel-ImageNet, a massively multilingual benchmark that offers (partial) translations of ImageNet labels to 100 languages, built without machine translation or manual annotation. We instead automatically obtain reliable translations by linking them -- via shared WordNet synsets -- to BabelNet, a massively multilingual lexico-semantic network. We evaluate 11 public multilingual CLIP models on zero-shot image classification (ZS-IC) on our benchmark, demonstrating a significant gap between English ImageNet performance and that of high-resource languages (e.g., German or Chinese), and an even bigger gap for low-resource languages (e.g., Sinhala or Lao). Crucially, we show that the models' ZS-IC performance highly correlates with their performance in image-text retrieval, validating the use of Babel-ImageNet to evaluate multilingual models for the vast majority of languages without gold image-text data. Finally, we show that the performance of multilingual CLIP can be drastically improved for low-resource languages with parameter-efficient language-specific training. We make our code and data publicly available: \url{this https URL}

Comments:	Accepted to ACL 2024
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2306.08658 [cs.CL]
	(or arXiv:2306.08658v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2306.08658

Submission history

From: Gregor Geigle [view email]
[v1] Wed, 14 Jun 2023 17:53:06 UTC (1,791 KB)
[v2] Wed, 12 Jun 2024 09:33:29 UTC (1,944 KB)

Computer Science > Computation and Language

Title:Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators