A Survey of Vision-Language Pre-Trained Models

Du, Yifan; Liu, Zikang; Li, Junyi; Zhao, Wayne Xin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2202.10936v1 (cs)

[Submitted on 18 Feb 2022 (this version), latest version 16 Jul 2022 (v2)]

Title:A Survey of Vision-Language Pre-Trained Models

Authors:Yifan Du, Zikang Liu, Junyi Li, Wayne Xin Zhao

View PDF

Abstract:As Transformer evolved, pre-trained models have advanced at a breakneck pace in recent years. They have dominated the mainstream techniques in natural language processing (NLP) and computer vision (CV). How to adapt pre-training to the field of Vision-and-Language (V-L) learning and improve the performance on downstream tasks becomes a focus of multimodal learning. In this paper, we review the recent progress in Vision-Language Pre-Trained Models (VL-PTMs). As the core content, we first briefly introduce several ways to encode raw images and texts to single-modal embeddings before pre-training. Then, we dive into the mainstream architectures of VL-PTMs in modeling the interaction between text and image representations. We further present widely-used pre-training tasks, after which we introduce some common downstream tasks. We finally conclude this paper and present some promising research directions. Our survey aims to provide multimodal researchers a synthesis and pointer to related research.

Comments:	Under review
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2202.10936 [cs.CV]
	(or arXiv:2202.10936v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2202.10936

Submission history

From: Yifan Du [view email]
[v1] Fri, 18 Feb 2022 15:15:46 UTC (213 KB)
[v2] Sat, 16 Jul 2022 01:27:59 UTC (215 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:A Survey of Vision-Language Pre-Trained Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:A Survey of Vision-Language Pre-Trained Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators