2020
DOI: 10.48550/arxiv.2009.14457
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning

Abstract: In this paper, we propose a multi-task learning-based framework that utilizes a combination of self-supervised and supervised pre-training tasks to learn a generic document representation. We design the network architecture and the pretraining tasks to incorporate the multi-modal document information across text, layout, and image dimensions and allow the network to work with multi-page documents. We showcase the applicability of our pre-training framework on a variety of different real-world document tasks su… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(9 citation statements)
references
References 19 publications
0
9
0
Order By: Relevance
“…These include two standard convolutional neural networks VGG-16 and ResNet-50, two multimodal ensemble approaches [13,8] using VGG-16 and a neural network for text encoding, plus the Sentence-BERT embedding. We have four task-agnostic learning methods, including two pre-trained language models [9,19], the approach proposed by Pramanik et al [25] pre-trained on arXiv dataset [4], and LayoutLM [36] pre-trained on IIT-CDIP dataset [17]. SelfDoc outperforms baselines.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…These include two standard convolutional neural networks VGG-16 and ResNet-50, two multimodal ensemble approaches [13,8] using VGG-16 and a neural network for text encoding, plus the Sentence-BERT embedding. We have four task-agnostic learning methods, including two pre-trained language models [9,19], the approach proposed by Pramanik et al [25] pre-trained on arXiv dataset [4], and LayoutLM [36] pre-trained on IIT-CDIP dataset [17]. SelfDoc outperforms baselines.…”
Section: Resultsmentioning
confidence: 99%
“…Document Pre-training. Most recently, some works have started pre-training models on document images [36,25]. The first one, LayoutLM [36], inherits the main idea from BERT while receiving the extra positional information for text in documents, and additionally includes image embeddings in the fine-tuning phase.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, pre-training models [7,22,35,36] show a strong feature representation using large-scale unlabeled training samples. Inspired by this, several works [28,45,46] combine pretraining techniques to improve multi-modal features. Pramanik et al [28] introduces a multi-task learning-based framework to yield a generic document representation.…”
Section: Related Workmentioning
confidence: 99%
“…Besides, for further improvement, mainstream researches [2,21,24,30,38,48,50] usually employ a shallow fusion of text, image, and layout to capture contextual dependencies. Recently, several pre-training models [28,45,46] have been proposed for joint learning the deep fusion of cross-modality on large-scale data and outperform counterparts on document understanding. Although these pre-training models consider all modalities of documents, they focus on the contribution related to the text side with less elaborate visual features.…”
Section: Introductionmentioning
confidence: 99%