2021 IEEE Winter Conference on Applications of Computer Vision (WACV) 2021
DOI: 10.1109/wacv48630.2021.00225
|View full text |Cite
|
Sign up to set email alerts
|

DocVQA: A Dataset for VQA on Document Images

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
101
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 180 publications
(101 citation statements)
references
References 19 publications
0
101
0
Order By: Relevance
“…Dataset. We use the PubLayNet dataset [38] and DocVQA dataset [23] to train the document object detector. Pub-LayNet includes 340K scholarly articles with bounding box on text block, heading, figure, list, and table, and DocVQA has 12K forms with a bounding box annotated for each text block.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Dataset. We use the PubLayNet dataset [38] and DocVQA dataset [23] to train the document object detector. Pub-LayNet includes 340K scholarly articles with bounding box on text block, heading, figure, list, and table, and DocVQA has 12K forms with a bounding box annotated for each text block.…”
Section: Methodsmentioning
confidence: 99%
“…To begin with, we train a document object detector using Faster R-CNN [28] on public document datasets [38,23] with bounding box annotations on semantically meaningful components, and localize significant components (i.e., document object proposals) of a document. In our current implementation, we detect the following categories: text block, title, list, table, and figure.…”
Section: Pre-processing and Feature Extractionmentioning
confidence: 99%
“…with Spatial-Aware Self-Attention Mechanism et al, 2015) for document image classification, as well as the DocVQA dataset (Mathew et al, 2020) for visual question answering on document images. Experiment results show that the Lay-outLMv2 model outperforms strong baselines including the vanilla LayoutLM and achieves new state-of-the-art results in these downstream VrDU tasks, which substantially benefits a great number of real-world document understanding tasks.…”
Section: Transformer Layersmentioning
confidence: 99%
“…The evaluation metric is the overall classification accuracy. Text and layout information is extracted by Microsoft OCR.DocVQA As a VQA dataset on the document understanding field, DocVQA(Mathew et al, 2020) consists of 50,000 questions defined on over 12,000 pages from a variety of documents. Pages are split into the training set, validation set and test set with a ratio of about 8:1:1.…”
mentioning
confidence: 99%
“…Similarly, ST-VQA [4] contains 32k questions on images from 6 different sources (IC13 [26], IC15 [25], ImageNet [10], VizWiz [3], IIIT Scene Text Retrieval, Visual Genome [29], and COCO-Text [59]). A series of datasets were introduced following these which focused on specific aspects of text-based VQA including OCR-VQA [41], STE-VQA [62], DocVQA [38], PlotQA [39], and LEAF-QA [7].…”
Section: Downstream Ocr Applicationsmentioning
confidence: 99%