Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3475345
|View full text |Cite
|
Sign up to set email alerts
|

StrucTexT: Structured Text Understanding with Multi-Modal Transformers

Abstract: Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence. Due to the complexity of content and layout in VRDs, structured text understanding has been a challenging task. Most existing studies decoupled this problem into two sub-tasks: entity labeling and entity linking, which require an entire understanding of the context of documents at both token and segment levels. However, little work has been concerned with the solutions that efficiently extract the struct… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
54
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 81 publications
(54 citation statements)
references
References 45 publications
0
54
0
Order By: Relevance
“…SelfDoc (Li et al, 2021b) established the contextualization over a block of content, while StructuralLM (Li et al, 2021a) proposed cell-level 2D position embeddings and the corresponding pre-training objective. Recently, StrucTexT (Li et al, 2021c) introduced a unified solution to efficiently extract semantic features from different levels and modalities to handle the entity labeling and entity linking tasks. Doc-Former (Appalaraju et al, 2021) designed a novel multi-modal self-attention layer capable of fusing textual, vision and spatial features.…”
Section: Related Workmentioning
confidence: 99%
“…SelfDoc (Li et al, 2021b) established the contextualization over a block of content, while StructuralLM (Li et al, 2021a) proposed cell-level 2D position embeddings and the corresponding pre-training objective. Recently, StrucTexT (Li et al, 2021c) introduced a unified solution to efficiently extract semantic features from different levels and modalities to handle the entity labeling and entity linking tasks. Doc-Former (Appalaraju et al, 2021) designed a novel multi-modal self-attention layer capable of fusing textual, vision and spatial features.…”
Section: Related Workmentioning
confidence: 99%
“…Most previous pre-training models [10], [35] produce an embedding sequence by collecting multimodal information from text, vision, and layout, and then perform a transformer network to establish deep fusion on different modalities. In our work, we adopt semantically meaningful components (e.g., text block, table, figure) as the model input.…”
Section: Gate Fusion Layermentioning
confidence: 99%
“…These entity fields are extracted either from preconditioned OCR results or parsing results from electronic formats like PDF or Microsoft Word. To keep a fixed input format, these entities are sorted beforehand according to the top-left to bottom-right order [22].…”
Section: Approachmentioning
confidence: 99%
“…Owing to their huge potential, VRDs understanding has attracted increasing attention in the multimedia community. Recent studies [16,19,22,41,42] usually follow a two-step pre-training then fine-tuning regime. With cutting-edge model designs, the pre-training step jointly models a multi-modal interaction between text, layout, and image, then produces comprehensive representations for fine-tuning specific downstream tasks.…”
Section: Introductionmentioning
confidence: 99%