MOFI: Learning Image Representations from Noisy Entity Annotated Images

Wu, Wentao; Timofeev, Aleksei; Chen, Chen; Zhang, Bowen; Duan, Kun; Liu, Shuangning; Zheng, Yantao; Shlens, Jonathon; Du, Xianzhi; Gan, Zhe; Yang, Yinfei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.07952 (cs)

[Submitted on 13 Jun 2023 (v1), last revised 17 Mar 2024 (this version, v3)]

Title:MOFI: Learning Image Representations from Noisy Entity Annotated Images

Authors:Wentao Wu, Aleksei Timofeev, Chen Chen, Bowen Zhang, Kun Duan, Shuangning Liu, Yantao Zheng, Jonathon Shlens, Xianzhi Du, Zhe Gan, Yinfei Yang

View PDF HTML (experimental)

Abstract:We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. It's a simple, cost-effective method that can scale to handle billions of web-mined image-text pairs. Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes like supervised pre-training, contrastive pre-training, and multi-task learning. For contrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations. We release our code and model weights at this https URL.

Comments:	Accepted to ICLR 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2306.07952 [cs.CV]
	(or arXiv:2306.07952v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.07952

Submission history

From: Zhe Gan [view email]
[v1] Tue, 13 Jun 2023 17:51:18 UTC (5,074 KB)
[v2] Sat, 24 Jun 2023 19:16:28 UTC (5,074 KB)
[v3] Sun, 17 Mar 2024 06:49:19 UTC (5,051 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MOFI: Learning Image Representations from Noisy Entity Annotated Images

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MOFI: Learning Image Representations from Noisy Entity Annotated Images

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators