DocVQA: A Dataset for VQA on Document Images

Mathew, Minesh; Karatzas, Dìmosthenis; Jawahar, C. V.

doi:10.1109/wacv48630.2021.00225

Cited by 180 publications

(101 citation statements)

References 19 publications

Supporting

Mentioning

101

Contrasting

Order By: Relevance

“…Dataset. We use the PubLayNet dataset [38] and DocVQA dataset [23] to train the document object detector. Pub-LayNet includes 340K scholarly articles with bounding box on text block, heading, figure, list, and table, and DocVQA has 12K forms with a bounding box annotated for each text block.…”

Section: Methodsmentioning

confidence: 99%

“…To begin with, we train a document object detector using Faster R-CNN [28] on public document datasets [38,23] with bounding box annotations on semantically meaningful components, and localize significant components (i.e., document object proposals) of a document. In our current implementation, we detect the following categories: text block, title, list, table, and figure.…”

Section: Pre-processing and Feature Extractionmentioning

confidence: 99%

See 1 more Smart Citation

SelfDoc: Self-Supervised Document Representation Learning

Li¹,

Gu²,

Kuen³

et al. 2021

Preprint

View full text Add to dashboard Cite

We propose SelfDoc, a task-agnostic pre-training framework for document image understanding. Because documents are multimodal and are intended for sequential reading, our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document, and it models the contextualization between each block of content. Unlike existing document pre-training models, our model is coarse-grained instead of treating individual words as input, therefore avoiding an overly fine-grained with excessive contextualization. Beyond that, we introduce cross-modal learning in the model pre-training phase to fully leverage multimodal information from unlabeled documents. For downstream usage, we propose a novel modality-adaptive attention mechanism for multimodal feature fusion by adaptively emphasizing language and vision signals. Our framework benefits from self-supervised pre-training on documents without requiring annotations by a feature masking training strategy. It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Pre-processing and Feature Extractionmentioning

confidence: 99%

SelfDoc: Self-Supervised Document Representation Learning

Li¹,

Gu²,

Kuen³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…with Spatial-Aware Self-Attention Mechanism et al, 2015) for document image classification, as well as the DocVQA dataset (Mathew et al, 2020) for visual question answering on document images. Experiment results show that the Lay-outLMv2 model outperforms strong baselines including the vanilla LayoutLM and achieves new state-of-the-art results in these downstream VrDU tasks, which substantially benefits a great number of real-world document understanding tasks.…”

Section: Transformer Layersmentioning

confidence: 99%

“…The evaluation metric is the overall classification accuracy. Text and layout information is extracted by Microsoft OCR.DocVQA As a VQA dataset on the document understanding field, DocVQA(Mathew et al, 2020) consists of 50,000 questions defined on over 12,000 pages from a variety of documents. Pages are split into the training set, validation set and test set with a ratio of about 8:1:1.…”

mentioning

confidence: 99%

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

et al. 2020

Preprint

View full text Add to dashboard Cite

Pre-training of text and layout has proved effective in a variety of visuallyrich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. In this paper, we present LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked visual-language modeling task but also the new text-image alignment and textimage matching tasks in the pre-training stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware selfattention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms strong baselines and achieves new state-of-the-art results on a wide variety of downstream visuallyrich document understanding tasks, including FUNSD (0.7895 → 0.8420), CORD (0.9493 → 0.9601), SROIE (0.9524 → 0.9781), Kleister-NDA (0.834 → 0.852), RVL-CDIP (0.9443 → 0.9564), and DocVQA (0.7295 → 0.8672).

show abstract

“…Similarly, ST-VQA [4] contains 32k questions on images from 6 different sources (IC13 [26], IC15 [25], ImageNet [10], VizWiz [3], IIIT Scene Text Retrieval, Visual Genome [29], and COCO-Text [59]). A series of datasets were introduced following these which focused on specific aspects of text-based VQA including OCR-VQA [41], STE-VQA [62], DocVQA [38], PlotQA [39], and LEAF-QA [7].…”

Section: Downstream Ocr Applicationsmentioning

confidence: 99%

TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

Singh¹,

Pang²,

Toh³

et al. 2021

Preprint

View full text Add to dashboard Cite

A crucial component for the scene text based reasoning required for TextVQA and TextCaps datasets involve detecting and recognizing text present in the images using an optical character recognition (OCR) system. The current systems are crippled by the unavailability of ground truth text annotations for these datasets as well as lack of scene text detection and recognition datasets on real images disallowing the progress in the field of OCR and evaluation of scene text based reasoning in isolation from OCR systems. In this work, we propose TextOCR, an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images from TextVQA dataset. We show that current state-of-the-art text-recognition (OCR) models fail to perform well on TextOCR and that training on TextOCR helps achieve state-of-the-art performance on multiple other OCR datasets as well. We use a TextOCR trained OCR model to create PixelM4C model which can do scene text based reasoning on an image in an end-to-end fashion, allowing us to revisit several design choices to achieve new state-of-the-art performance on TextVQA dataset.

show abstract

DocVQA: A Dataset for VQA on Document Images

Cited by 180 publications

References 19 publications

SelfDoc: Self-Supervised Document Representation Learning

SelfDoc: Self-Supervised Document Representation Learning

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

Contact Info

Product

Resources

About