LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Xu, Yanjin; Li, Minghao; Cui, Lei; Huang, Shaohan; Wei, Furu; Zhou, Ming

doi:10.1145/3394486.3403172

Cited by 489 publications

(397 citation statements)

References 65 publications

Supporting

Mentioning

394

Contrasting

Unclassified

Order By: Relevance

“…The second promising direction is the multimodal processing of the graphical objects. In the case of graphical page object detection, multimodal processing, in the simplest form, is the processing of image information and text information together [62,63]. An example of such a case is when a figure is categorized as a table and vice versa; the text information can be beneficial.…”

Section: Future Workmentioning

confidence: 99%

A Survey of Graphical Page Object Detection with Deep Neural Networks

et al. 2021

View full text Add to dashboard Cite

In any document, graphical elements like tables, figures, and formulas contain essential information. The processing and interpretation of such information require specialized algorithms. Off-the-shelf OCR components cannot process this information reliably. Therefore, an essential step in document analysis pipelines is to detect these graphical components. It leads to a high-level conceptual understanding of the documents that make the digitization of documents viable. Since the advent of deep learning, deep learning-based object detection performance has improved many folds. This work outlines and summarizes the deep learning approaches for detecting graphical page objects in document images. Therefore, we discuss the most relevant deep learning-based approaches and state-of-the-art graphical page object detection in document images. This work provides a comprehensive understanding of the current state-of-the-art and related challenges. Furthermore, we discuss leading datasets along with the quantitative evaluation. Moreover, it discusses briefly the promising directions that can be utilized for further improvements.

show abstract

Section: Future Workmentioning

confidence: 99%

A Survey of Graphical Page Object Detection with Deep Neural Networks

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Extracting pre-defined and commonly occurring named entities from invoices like documents(using text and box coordinates) has been the main focus for some prior works (Katti et al, 2018;Liu et al, 2019;Denk and Reisswig, 2019;Majumder et al, 2020). Text and document layouts have been used for learning BERT (Devlin et al, 2019) like representations through pre-training and then combined with image features for information extraction from documents (Xu et al, 2020;Garncarek et al, 2020). However, our work focuses on extracting a much more generic, diverse, complex, dense, and hierarchical document structure from Forms.…”

Section: Related Workmentioning

confidence: 99%

Form2Seq : A Framework for Higher-Order Form Structure Extraction

Aggarwal¹,

Gupta²,

Sarkar³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Document structure extraction has been a widely researched area for decades with recent works performing it as a semantic segmentation task over document images using fullyconvolution networks. Such methods are limited by image resolution due to which they fail to disambiguate structures in dense regions which appear commonly in forms. To mitigate this, we propose Form2Seq, a novel sequenceto-sequence (Seq2Seq) inspired framework for structure extraction using text, with a specific focus on forms, which leverages relative spatial arrangement of structures. We discuss two tasks; 1) Classification of low-level constituent elements (TextBlock and empty fillable Widget) into ten types such as field captions, list items, and others; 2) Grouping lower-level elements into higher-order constructs, such as Text Fields, ChoiceFields and ChoiceGroups, used as information collection mechanism in forms. To achieve this, we arrange the constituent elements linearly in natural reading order, feed their spatial and textual representations to Seq2Seq framework, which sequentially outputs prediction of each element depending on the final task. We modify Seq2Seq for grouping task and discuss improvements obtained through cascaded end-to-end training of two tasks versus training in isolation. Experimental results show the effectiveness of our text-based approach achieving an accuracy of 90% on classification task and an F1 of 75.82, 86.01, 61.63 on groups discussed above respectively, outperforming segmentation baselines. Further we show our framework achieves state of the results for table structure recognition on ICDAR 2013 dataset.

show abstract

“…LayoutLM (Xu et al, 2019) is a BERT-like transformer model modified to generate layoutaware contextualized word embeddings. In place of BERT's single positional embedding, LayoutLM adds positional embeddings for the x-and ycoordinates of a bounding box around the token.…”

Section: Systemsmentioning

confidence: 99%

“…We therefore expect that a hybrid document representation that combines layout and text information should outperform a text-only representation when clustering documents by type. LayoutLM (Xu et al, 2019) is such a hybrid system and achieves state-of-theart performance for document-type classification, outperforming text-only baselines. We therefore hypothesized that LayoutLM would also outperform these baselines for document-type clustering.…”

Section: Introductionmentioning

confidence: 99%

Layout-Aware Text Representations Harm Clustering Documents by Type

Finegan-Dollak¹,

Verma²

2020

Proceedings of the First Workshop on Insights From Negative Results in NLP

View full text Add to dashboard Cite

Clustering documents by type-grouping invoices with invoices and articles with articles-is a desirable first step for organizing large collections of document scans. Humans approaching this task use both the semantics of the text and the document layout to assist in grouping like documents. Lay-outLM (Xu et al., 2019), a layout-aware transformer built on top of BERT with state-of-theart performance on document-type classification, could reasonably be expected to outperform regular BERT (Devlin et al., 2018) for document-type clustering. However, we find experimentally that BERT significantly outperforms LayoutLM on this task (p < 0.001). We analyze clusters to show where layout awareness is an asset and where it is a liability.

show abstract

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Cited by 489 publications

References 65 publications

A Survey of Graphical Page Object Detection with Deep Neural Networks

A Survey of Graphical Page Object Detection with Deep Neural Networks

Form2Seq : A Framework for Higher-Order Form Structure Extraction

Layout-Aware Text Representations Harm Clustering Documents by Type

Contact Info

Product

Resources

About