Document AI: Benchmarks, Models and Applications

Cui, Lei; Xu, Yanjin; Lv, Tengchao; Wei, Furu

doi:10.48550/arxiv.2111.08609

Cited by 8 publications

(10 citation statements)

References 105 publications

(77 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recent years, pre-training techniques have been making waves in the Document AI community by achieving remarkable progress on document understanding tasks [2,12-14, 16, 25, 28, 29, 36, 37, 44, 45, 47-49]. As shown in Figure 1, a pre-trained Document AI model can parse layout and extract key information for various documents such as scanned forms and academic papers, which is important for industrial applications and academic research [7].…”

Section: Introductionmentioning

confidence: 99%

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Huang¹,

Lv²,

Cui³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pretrained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose Lay-outLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in imagecentric tasks such as document image classification and document layout analysis. The code and models are publicly available at https://aka.ms/layoutlmv3.

show abstract

Section: Introductionmentioning

confidence: 99%

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Huang¹,

Lv²,

Cui³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…It leads to an important research direction for both Computer Vision (CV) and Natural Language Processing (NLP), and is a fundamental task of Document AI, which aims to automatically read, understand, and analyze documents. [1].…”

Section: Introductionmentioning

confidence: 99%

Transformer-Based Approach for Document Layout Understanding

Yang

Hsu

2022

2022 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

We present an end-to-end transformer-based framework named TRDLU for the task of Document Layout Understanding (DLU). DLU is the fundamental task to automatically understand document structures. To accurately detect content boxes and classify them into semantically meaningful classes from various formats of documents is still an open challenge. Recently, transformer-based detection neural networks have shown their capability over traditional convolutional-based methods in the object detection area. In this paper, we consider DLU as a detection task, and introduce TRDLU which integrates transformer-based vision backbone and transformer encoder-decoder as detection pipeline. TRDLU is only a visual feature-based framework, but its performance is even better than multi-modal feature-based models. To the best of our knowledge, this is the first study of employing a fully transformer-based framework in DLU tasks. We evaluated TRDLU on three different DLU benchmark datasets, each with strong baselines. TRDLU outperforms the current stateof-the-art methods on all of them.

show abstract

“…Visually-rich Document Understanding (VrDU) is a critical component of document intelligence [6] that aims to understand scanned or digital-born documents. Despite many advances in vision-language understanding, extracting structural information in visually-rich documents remains a major challenge because it involves different types of information, including image, text, and layout.…”

Section: Introductionmentioning

confidence: 99%

ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding

Wang¹,

Huang²,

Luo³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent efforts of multimodal Transformers have improved VisuallyRich Document Understanding (VrDU) tasks via incorporating visual and textual information. However, existing approaches mainly focus on fine-grained elements such as words and document image patches, making it hard for them to learn from coarse-grained elements, including natural lexical units like phrases and salient visual regions like prominent image regions. In this paper, we attach more importance to coarse-grained elements containing high-density information and consistent semantics, which are valuable for document understanding. At first, a document graph is proposed to model complex relationships among multi-grained multimodal elements, in which salient visual regions are detected by a clusterbased method. Then, a multi-grained multimodal Transformer called mmLayout is proposed to incorporate coarse-grained information into existing pre-trained fine-grained multimodal Transformers based on the graph. In mmLayout, coarse-grained information is aggregated from fine-grained, and then, after further processing, is fused back into fine-grained for final prediction. Furthermore, common sense enhancement is introduced to exploit the semantic information of natural lexical units. Experimental results on four tasks, including information extraction and document question answering, show that our method can improve the performance of multimodal Transformers based on fine-grained elements and achieve better performance with fewer parameters.

show abstract

Document AI: Benchmarks, Models and Applications

Cited by 8 publications

References 105 publications

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Transformer-Based Approach for Document Layout Understanding

ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding

Contact Info

Product

Resources

About