2022
DOI: 10.48550/arxiv.2204.08387
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Abstract: Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pretrained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose Lay-outLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLM… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
26
0
1

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 17 publications
(27 citation statements)
references
References 24 publications
0
26
0
1
Order By: Relevance
“…Therefore, the results reported demonstrate that our proposed approach outperforms all the methods that do not require any supplementary information such as layout information as used in [1,2,8,3,4]. Meanwhile, it achieves competitive results against the methods that include layout information in the pre-training setting.…”
Section: Resultsmentioning
confidence: 76%
See 3 more Smart Citations
“…Therefore, the results reported demonstrate that our proposed approach outperforms all the methods that do not require any supplementary information such as layout information as used in [1,2,8,3,4]. Meanwhile, it achieves competitive results against the methods that include layout information in the pre-training setting.…”
Section: Resultsmentioning
confidence: 76%
“…The mechanisms used to leverage features from document modalities differ one from another. In [1,5,8,2], the authors propose a joint multimodal approach to model the interaction between textual, visual, and layout information in a unified multimodal pre-training network. Besides, [3] exploit cross-modal learning in the pre-training stage to perform a task-agnostic framework to model information across textual, visual, and layout information modalities without requiring document data annotation.…”
Section: Multimodal Document Pre-trainingmentioning
confidence: 99%
See 2 more Smart Citations
“…Document understanding has undoubtedly been an important research topic as documents play an essential role in message delivery in our daily lives . During the past several years, the flourishing blossom of deep learning has witnessed the rapid development of document understanding in various formats, ranging from plain texts (Devlin et al, 2018;Dong et al, 2019), document texts (Xu et al, , 2021aHuang et al, 2022), and web texts Li et al, 2022a;. Recently, pretraining techniques have been the de facto standard for document understanding, where the model is first pre-trained in a self-supervised manner (e.g.…”
Section: Introductionmentioning
confidence: 99%