Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models

Wei, Mengxi; He, Yifan; Zhang, Qiong

doi:10.1145/3397271.3401442

Cited by 39 publications

(10 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we introduce the datasets for pretraining and evaluation. Two datasets, RVL-CDIP [38] and DocBank [21] are utilized for pre-training. We conduct evaluations on three downstream tasks: 1) Reading order detection task on ReadingBank [37], 2) Table structure recognition task on SciTSR [6], ICDAR-2013 [11] and ICDAR-2019 [10], in which we follow the Setup-B setting in [24] where input by image along with layouts and contents, 3) Key information extraction task on FUNSD [18] and CORD [29], in which we focus on their entity linking tasks that rely on analyzing pairwise relation between entities.…”

Section: Datasets and Evaluation Protocolmentioning

confidence: 99%

Relational Representation Learning in Visually-Rich Documents

Li¹,

Zheng²,

Hu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Relational understanding is critical for a number of visually-rich documents (VRDs) understanding tasks. Through multi-modal pre-training, recent studies provide comprehensive contextual representations and exploit them as prior knowledge for downstream tasks. In spite of their impressive results, we observe that the widespread relational hints (e.g., relation of key/value fields on receipts) built upon contextual knowledge are not excavated yet. To mitigate this gap, we propose DocReL, a Document Relational Representation Learning framework. The major challenge of DocReL roots in the variety of relations. From the simplest pairwise relation to the complex global structure, it is infeasible to conduct supervised training due to the definition of relation varies and even conflicts in different tasks. To deal with the unpredictable definition of relations, we propose a novel contrastive learning task named Relational Consistency Modeling (RCM), which harnesses the fact that existing relations should be consistent in differently augmented positive views. RCM provides relational representations which are more compatible to the urgent need of downstream tasks, even without any knowledge about the exact definition of relation. DocReL achieves better performance on a wide variety of VRD relational understanding tasks, including table structure recognition, key information extraction and reading order detection.* Equal contribution. † Contact person.

show abstract

Section: Datasets and Evaluation Protocolmentioning

confidence: 99%

Relational Representation Learning in Visually-Rich Documents

Li¹,

Zheng²,

Hu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…However, traditional NER models organize text in one dimension depending on the reading order and are unsuitable for VRDs with complex layouts. Recent studies [29,38,42,45,46,48,50] have realized the significance of segment-level features and incorporate a segment embedding to attach extra higher semantics. Although those methods, such as PICK [48] and TRIE [50], construct contextual features involving the segment clues, they revert to token-level labeling with NER-based schemes.…”

Section: Related Workmentioning

confidence: 99%

StrucTexT: Structured Text Understanding with Multi-Modal Transformers

Qian

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence. Due to the complexity of content and layout in VRDs, structured text understanding has been a challenging task. Most existing studies decoupled this problem into two sub-tasks: entity labeling and entity linking, which require an entire understanding of the context of documents at both token and segment levels. However, little work has been concerned with the solutions that efficiently extract the structured data from different levels. This paper proposes a unified framework named StrucTexT, which is flexible and effective for handling both sub-tasks. Specifically, based on the transformer, we introduce a segment-token aligned encoder to deal with the entity labeling and entity linking tasks at different levels of granularity. Moreover, we design a novel pre-training strategy with three self-supervised tasks to learn a richer representation. StrucTexT uses the existing Masked Visual Language Modeling task and the new Sentence Length Prediction and Paired Boxes Direction tasks to incorporate the multi-modal information across text, image, and layout. We evaluate our method for structured text understanding at segment-level and token-level and show it outperforms the state-of-the-art counterparts with significantly superior performance on the FUNSD, SROIE, and EPHOIE datasets.

show abstract

“…While the image modality was introduced only at the finetuning stage in LayoutLM, later models [28,14,35] include visual descriptors from convolutional layers directly into the token representations used for pre-training. These recent works mainly focus on adding new pre-training objectives complementing MVLM to more effectively mix the text, layout and image modalities when learning the document representations, for example the topic-modeling and document shuffling tasks of [28], the Sequence Positional Relationship Classification (SPRC) objective [34], the text-image alignment and matching tasks leveraged in [35] and the 2D area-masking strategy from [14]. Moreover, [35,14] both modify the computation of the self-attention scores to better encompass the relative positional relationships among the tokens of the document.…”

Section: Related Work On Information Extraction (Ie)mentioning

confidence: 99%

Data-Efficient Information Extraction from Documents with Pre-trained Language Models

Sage

Douzon

Aussem

et al. 2021

Document Analysis and Recognition – ICDAR 2021 Workshops

View full text Add to dashboard Cite

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

show abstract

Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models

Cited by 39 publications

References 17 publications

Relational Representation Learning in Visually-Rich Documents

Relational Representation Learning in Visually-Rich Documents

StrucTexT: Structured Text Understanding with Multi-Modal Transformers

Data-Efficient Information Extraction from Documents with Pre-trained Language Models

Contact Info

Product

Resources

About