RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder

Xiao, Shitao; Liu, Zheng; Shao, Yingxia; Cao, Zhao

doi:10.48550/arxiv.2205.12035

Cited by 5 publications

(12 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Later on, the auto-encoding based pre-training algorithms receive growing interests: the input sentences are encoded into embeddings and reconstructed back to the original sentences (Lu et al, 2021;. The recently proposed methods, such as SimLM and RetroMAE (Xiao et al, 2022a), extend the previous auto-encoding framework by upgrading the encoding and decoding mechanisms, which substantially improves the quality of deep semantic retrieval.…”

Section: Related Workmentioning

confidence: 99%

“…Later on, auto-encoding is found to be more effective Lu et al, 2021), where the language models are learned to reconstruct the input based on the generated embeddings. The recent work RetroMAE (Xiao et al, 2022a) extends the previous auto-encoding methods by introducing the enhanced encoding and decoding mechanisms, which leads to remarkable improvements on general retrieval benchmarks.…”

Section: Introductionmentioning

confidence: 99%

“…The existing retrieval-oriented pre-trained models mainly rely on the contextualized embedding from the special token, i.e., [CLS], to represent the semantic about input (Gao and Callan, 2021;Lu et al, 2021;Xiao et al, 2022a;. However, recent study finds that other ordinary tokens may provide extra information and help to generate better semantic representations (Lin et al, 2022).…”

Section: Introductionmentioning

confidence: 99%

“…It introduces two decoding modules, which work together to enhance the semantic representation capacity for both types of contextualized embeddings. Particularly, we leverage the decoder from RetroMAE (Xiao et al, 2022a), where the [CLS] embedding, joined with the masked input, is used to recover the original sentence via an onelayer transformer. Meanwhile, the contextualized embeddings from ordinary tokens are transformed into the vocabulary space (i.e, |V |-dim vectors) via a linear projection unit (i.e., a d × |V | matrix).…”

Section: Introductionmentioning

confidence: 99%

“…

…”

mentioning

confidence: 99%

See 4 more Smart Citations

RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models

Xiao¹,

Liu²

2022

Preprint

View full text Add to dashboard Cite

To better support retrieval applications such as web search and question answering, growing effort is made to develop retrieval-oriented language models (Gao and Callan, 2021;Xiao et al., 2022a). Most of the existing works focus on improving the semantic representation capability for the contextualized embedding of [CLS] token. However, recent study shows that the ordinary tokens besides [CLS] may provide extra information, which helps to produce a better representation effect (Lin et al., 2022). As such, it's necessary to extend the current methods where all contextualized embeddings can be jointly pre-trained for the retrieval tasks.With this motivation, we propose a new pretraining method: duplex masked auto-encoder, a.k.a. DupMAE, which targets on improving the semantic representation capacity for the contextualized embeddings of both [CLS] and ordinary tokens. It introduces two decoding tasks: one is to reconstruct the original input sentence based on the [CLS] embedding, the other one is to minimize the bag-of-words loss (BoW) about the input sentence based on the entire ordinary tokens' embeddings. The two decoding losses are added up to train a unified encoding model. The embeddings from [CLS] and ordinary tokens, after dimension reduction and aggregation, are concatenated as one unified semantic representation for the input. DupMAE is simple but empirically competitive: with a small decoding cost, it substantially contributes to the model's representation capability and transferability, where remarkable improvements are achieved on MS MARCO and BEIR benchmarks.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%