LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Huang, Yupan; Lv, Tengchao; Cui, Lei; Lu, Yutong; Wei, Furu

doi:10.48550/arxiv.2204.08387

Cited by 17 publications

(27 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Therefore, the results reported demonstrate that our proposed approach outperforms all the methods that do not require any supplementary information such as layout information as used in [1,2,8,3,4]. Meanwhile, it achieves competitive results against the methods that include layout information in the pre-training setting.…”

Section: Resultsmentioning

confidence: 76%

“…The mechanisms used to leverage features from document modalities differ one from another. In [1,5,8,2], the authors propose a joint multimodal approach to model the interaction between textual, visual, and layout information in a unified multimodal pre-training network. Besides, [3] exploit cross-modal learning in the pre-training stage to perform a task-agnostic framework to model information across textual, visual, and layout information modalities without requiring document data annotation.…”

Section: Multimodal Document Pre-trainingmentioning

confidence: 99%

“…Specifically, attention learning has seen increased attention lately in the field of document understanding, imagetext matching, and cross-modal retrieval, aiming at learning the internal relations in a text sentence or in an image. To model the internal relationships among different modalities, we adopt the contextualized attention mechanism from natural language processing (NLP) [20] to improve the location accuracy of a document image region in the vision modality for the desired text sequence in the language modality [8,3,4,5]. Our proposal highlights both the cross-modal co-attention (InterMCA), and internal self-attention (IntraMSA) mechanisms which are integrated in the proposed model, which means that self-attention and co-attention are integrated in the proposed model.…”

Section: Attention Mechanismmentioning

confidence: 99%

“…Therefore, recent research has started to consider how to leverage and incorporate the relations within those different modalities in a unified network to capture latent information for ex-ploring better yet effective multimodal representations. Such systems have shown their effectiveness in improving multimodal representation learning in a pretrainthen-finetune paradigm, where models are first pre-trained with large-scale data and then fine-tuned to each downstream task [1,2,3,4,5,6,7,8].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

Bakkali¹,

Ming²,

Coustaty³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Section: Resultsmentioning

confidence: 76%

Section: Multimodal Document Pre-trainingmentioning

confidence: 99%

Section: Attention Mechanismmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

Bakkali¹,

Ming²,

Coustaty³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Document understanding has undoubtedly been an important research topic as documents play an essential role in message delivery in our daily lives . During the past several years, the flourishing blossom of deep learning has witnessed the rapid development of document understanding in various formats, ranging from plain texts (Devlin et al, 2018;Dong et al, 2019), document texts (Xu et al, , 2021aHuang et al, 2022), and web texts Li et al, 2022a;. Recently, pretraining techniques have been the de facto standard for document understanding, where the model is first pre-trained in a self-supervised manner (e.g.…”

Section: Introductionmentioning

confidence: 99%

XDoc: Unified Pre-training for Cross-Format Document Understanding

Chen¹,

Lv²,

Cui³

et al. 2022

Preprint

View full text Add to dashboard Cite

The surge of pre-training has witnessed the rapid development of document understanding recently. Pre-training and fine-tuning framework has been effectively used to tackle texts in various formats, including plain texts, document texts, and web texts. Despite achieving promising performance, existing pre-trained models usually target one specific document format at one time, making it difficult to combine knowledge from multiple document formats. To address this, we propose XDoc, a unified pre-trained model which deals with different document formats in a single model. For parameter efficiency, we share backbone parameters for different formats such as the word embedding layer and the Transformer layers. Meanwhile, we introduce adaptive layers with lightweight parameters to enhance the distinction across different formats. Experimental results have demonstrated that with only 36.7% parameters, XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models, which is cost effective for real-world deployment. The code and pre-trained models will be publicly available at https://aka.ms/xdoc.

show abstract

Unimodal and Multimodal Representation Training for Relation Extraction

Cooney¹,

Rachel²,

Liam³

et al. 2023

Communications in Computer and Information Science

View full text Add to dashboard Cite

Multimodal integration of text, layout and visual information has achieved SOTA results in visually rich document understanding (VrDU) tasks, including relation extraction (RE). However, despite its importance, evaluation of the relative predictive capacity of these modalities is less prevalent. Here, we demonstrate the value of shared representations for RE tasks by conducting experiments in which each data type is iteratively excluded during training. In addition, text and layout data are evaluated in isolation. While a bimodal text and layout approach performs best (F1 = 0.684), we show that text is the most important single predictor of entity relations. Additionally, layout geometry is highly predictive and may even be a feasible unimodal approach. Despite being less effective, we highlight circumstances where visual information can bolster performance. In total, our results demonstrate the efficacy of training joint representations for RE.

show abstract

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Cited by 17 publications

References 24 publications

VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

XDoc: Unified Pre-training for Cross-Format Document Understanding

Unimodal and Multimodal Representation Training for Relation Extraction

Contact Info

Product

Resources

About