UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Zhou, Mingyang; Zhou, Luowei; Wang, Shuohang; Cheng, Yu; Li, Linjie; Yu, Zhou; Liu, Jingjing

Computer Science > Computer Vision and Pattern Recognition

arXiv:2104.00332 (cs)

[Submitted on 1 Apr 2021]

Title:UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Authors:Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu, Jingjing Liu

View PDF

Abstract:Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language. To generalize this success to non-English languages, we introduce UC2, the first machine translation-augmented framework for cross-lingual cross-modal representation learning. To tackle the scarcity problem of multilingual captions for image datasets, we first augment existing English-only datasets with other languages via machine translation (MT). Then we extend the standard Masked Language Modeling and Image-Text Matching training objectives to multilingual setting, where alignment between different languages is captured through shared visual context (i.e, using image as pivot). To facilitate the learning of a joint embedding space of images and all languages of interest, we further propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM), leveraging MT-enhanced translated data. Evaluation on multilingual image-text retrieval and multilingual visual question answering benchmarks demonstrates that our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2104.00332 [cs.CV]
	(or arXiv:2104.00332v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2104.00332

Submission history

From: Mingyang Zhou [view email]
[v1] Thu, 1 Apr 2021 08:30:53 UTC (2,403 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Submission history

Access Paper:

References & Citations

2 blog links

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Submission history

Access Paper:

References & Citations

2 blog links

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators