2021
DOI: 10.48550/arxiv.2110.00061
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

PubTables-1M: Towards comprehensive table extraction from unstructured documents

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 0 publications
0
4
0
Order By: Relevance
“…For the experiments we use tables from the test set of the PubTables-1M [8] dataset, which provides text content and location information for every cell, including blank cells. To make sure each remaining grid cell has well-defined content and location after removing a subset of rows and columns from a table, we only use tables that do not contain any spanning cells.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…For the experiments we use tables from the test set of the PubTables-1M [8] dataset, which provides text content and location information for every cell, including blank cells. To make sure each remaining grid cell has well-defined content and location after removing a subset of rows and columns from a table, we only use tables that do not contain any spanning cells.…”
Section: Methodsmentioning
confidence: 99%
“…Measures of partial correctness are useful not only because they are more granular, but also because they are less impacted by small errors and ambiguities in the ground truth. This is important, as creating unambiguous ground truth for table structure recognition is a challenging problem, which introduces noise not only into the learning task but also performance evaluation [8,9]. Designing a metric for partial table correctness has also proven challenging.…”
Section: Introductionmentioning
confidence: 99%
“…A few samples from this dataset are shown in Figure 14. PubTables-1M [63] contains nearly one million tables from scientific articles, supports multiple input modalities and contains detailed header and location information for table structures, making it useful for a wide variety of modeling approaches. It also addresses a significant source of ground truth inconsistency observed in prior datasets called over-segmentation, using a novel canonicalization procedure.…”
Section: Uw-3 Tablementioning
confidence: 99%
“…Recently transformer-based models were applied to document layout analysis, Smock, Brandon et al [63] applied Carion et al [93] DEtection TRansformer framework, a transformer encoder-decoder architecture, to their table dataset for both table detection and structure recognition tasks. Xu et al [94] proposed a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for document analysis, including table detection…”
Section: The System Searches For Sequences Of Table-like Lines Based ...mentioning
confidence: 99%