Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.82
|View full text |Cite
|
Sign up to set email alerts
|

DocBank: A Benchmark Dataset for Document Layout Analysis

Abstract: Document layout analysis usually relies on computer vision models to understand documents while ignoring textual information that is vital to capture. Meanwhile, high quality labeled datasets with both visual and textual information are still insufficient. In this paper, we present DocBank, a benchmark dataset that contains 500K document pages with fine-grained tokenlevel annotations for document layout analysis. DocBank is constructed using a simple yet effective way with weak supervision from the L A T E X d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
85
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 144 publications
(94 citation statements)
references
References 38 publications
0
85
0
Order By: Relevance
“…Another dataset to solve document layout analysis is released by Li et al [13]. The dataset is known as DocBank, which is the extended version of the TableBank dataset [46].…”
Section: Docbankmentioning
confidence: 99%
See 3 more Smart Citations
“…Another dataset to solve document layout analysis is released by Li et al [13]. The dataset is known as DocBank, which is the extended version of the TableBank dataset [46].…”
Section: Docbankmentioning
confidence: 99%
“…In this survey paper, we have presented a thorough analysis of the recent state-of-theart approaches that have approached the problem of graphical page object detection in scanned document images by employing deep neural networks. Since page objects can be of several types [13], we have covered the three most important page objects in document images [9]. These graphical page objects are referred to as table, formulas, and figures.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…The study [11] developed token-level annotations on a scientific articles dataset named DocBank, used for the article metadata extraction and document layout analysis tasks. It consists of 500 thousand document pages with the label annotations such as Abstract, Author, Caption, Figure , and few others.…”
Section: A Standard Datasetsmentioning
confidence: 99%