2022
DOI: 10.48550/arxiv.2209.08569
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding

Abstract: Recent efforts of multimodal Transformers have improved VisuallyRich Document Understanding (VrDU) tasks via incorporating visual and textual information. However, existing approaches mainly focus on fine-grained elements such as words and document image patches, making it hard for them to learn from coarse-grained elements, including natural lexical units like phrases and salient visual regions like prominent image regions. In this paper, we attach more importance to coarse-grained elements containing high-de… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 35 publications
0
1
0
Order By: Relevance
“…Deep learning networks have made extraordinary progress in artificial intelligence [1]- [4]. Multigranularity structure is an important feature of deep networks, whether it is image, text, or speech data, where feature representations can be extracted at different granularities [5], [6]. Deep networks extract finer-grained features by processing data from low to high and building multi-level, multi-granular semantics [7]- [9].…”
Section: Introductionmentioning
confidence: 99%
“…Deep learning networks have made extraordinary progress in artificial intelligence [1]- [4]. Multigranularity structure is an important feature of deep networks, whether it is image, text, or speech data, where feature representations can be extracted at different granularities [5], [6]. Deep networks extract finer-grained features by processing data from low to high and building multi-level, multi-granular semantics [7]- [9].…”
Section: Introductionmentioning
confidence: 99%