SelfDoc: Self-Supervised Document Representation Learning

Li, Peizhao; Gu, Jiuxiang; Kuen, Jason; Morariu, Vlad I.; Zhao, Handong; Jain, Rajiv; Manjunatha, Varun; Liu, Hongfu

doi:10.48550/arxiv.2106.03331

Cited by 1 publication

(4 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, the results reported demonstrate that our proposed approach outperforms all the methods that do not require any supplementary information such as layout information as used in [1,2,8,3,4]. Meanwhile, it achieves competitive results against the methods that include layout information in the pre-training setting.…”

Section: Resultsmentioning

confidence: 76%

“…In [1,5,8,2], the authors propose a joint multimodal approach to model the interaction between textual, visual, and layout information in a unified multimodal pre-training network. Besides, [3] exploit cross-modal learning in the pre-training stage to perform a task-agnostic framework to model information across textual, visual, and layout information modalities without requiring document data annotation. In [4], the authors encourage multimodal interaction using a multimodal transformer architecture to perform visual document understanding.…”

Section: Multimodal Document Pre-trainingmentioning

confidence: 99%

“…Specifically, attention learning has seen increased attention lately in the field of document understanding, imagetext matching, and cross-modal retrieval, aiming at learning the internal relations in a text sentence or in an image. To model the internal relationships among different modalities, we adopt the contextualized attention mechanism from natural language processing (NLP) [20] to improve the location accuracy of a document image region in the vision modality for the desired text sequence in the language modality [8,3,4,5]. Our proposal highlights both the cross-modal co-attention (InterMCA), and internal self-attention (IntraMSA) mechanisms which are integrated in the proposed model, which means that self-attention and co-attention are integrated in the proposed model.…”

Section: Attention Mechanismmentioning

confidence: 99%

“…Therefore, recent research has started to consider how to leverage and incorporate the relations within those different modalities in a unified network to capture latent information for ex-ploring better yet effective multimodal representations. Such systems have shown their effectiveness in improving multimodal representation learning in a pretrainthen-finetune paradigm, where models are first pre-trained with large-scale data and then fine-tuned to each downstream task [1,2,3,4,5,6,7,8].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations