2021
DOI: 10.48550/arxiv.2106.03331
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SelfDoc: Self-Supervised Document Representation Learning

Abstract: We propose SelfDoc, a task-agnostic pre-training framework for document image understanding. Because documents are multimodal and are intended for sequential reading, our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document, and it models the contextualization between each block of content. Unlike existing document pre-training models, our model is coarse-grained instead of treating individual words as input, therefore avoiding an overly fi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(4 citation statements)
references
References 25 publications
0
4
0
Order By: Relevance
“…Therefore, the results reported demonstrate that our proposed approach outperforms all the methods that do not require any supplementary information such as layout information as used in [1,2,8,3,4]. Meanwhile, it achieves competitive results against the methods that include layout information in the pre-training setting.…”
Section: Resultsmentioning
confidence: 76%
See 3 more Smart Citations
“…Therefore, the results reported demonstrate that our proposed approach outperforms all the methods that do not require any supplementary information such as layout information as used in [1,2,8,3,4]. Meanwhile, it achieves competitive results against the methods that include layout information in the pre-training setting.…”
Section: Resultsmentioning
confidence: 76%
“…In [1,5,8,2], the authors propose a joint multimodal approach to model the interaction between textual, visual, and layout information in a unified multimodal pre-training network. Besides, [3] exploit cross-modal learning in the pre-training stage to perform a task-agnostic framework to model information across textual, visual, and layout information modalities without requiring document data annotation. In [4], the authors encourage multimodal interaction using a multimodal transformer architecture to perform visual document understanding.…”
Section: Multimodal Document Pre-trainingmentioning
confidence: 99%
See 2 more Smart Citations