RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder

Liu, Zheng; Shao, Yingxia

Computer Science > Computation and Language

arXiv:2205.12035v1 (cs)

[Submitted on 24 May 2022 (this version), latest version 17 Oct 2022 (v2)]

Title:RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder

Authors:Zheng Liu, Yingxia Shao

View PDF

Abstract:Pre-trained models have demonstrated superior power on many important tasks. However, it is still an open problem of designing effective pre-training strategies so as to promote the models' usability on dense retrieval. In this paper, we propose a novel pre-training framework for dense retrieval based on the Masked Auto-Encoder, known as RetroMAE. Our proposed framework is highlighted for the following critical designs: 1) a MAE based pre-training workflow, where the input sentence is polluted on both encoder and decoder side with different masks, and original sentence is reconstructed based on both sentence embedding and masked sentence; 2) asymmetric model architectures, with a large-scale expressive transformer for sentence encoding and a extremely simplified transformer for sentence reconstruction; 3) asymmetric masking ratios, with a moderate masking on the encoder side (15%) and an aggressive masking ratio on the decoder side (50~90%). We pre-train a BERT like encoder on English Wikipedia and BookCorpus, where it notably outperforms the existing pre-trained models on a wide range of dense retrieval benchmarks, like MS MARCO, Open-domain Question Answering, and BEIR.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2205.12035 [cs.CL]
	(or arXiv:2205.12035v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2205.12035

Submission history

From: Zheng Liu [view email]
[v1] Tue, 24 May 2022 12:43:04 UTC (188 KB)
[v2] Mon, 17 Oct 2022 14:08:37 UTC (358 KB)

Computer Science > Computation and Language

Title:RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators