2020
DOI: 10.48550/arxiv.2003.02356
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout

Filip Graliński,
Tomasz Stanisławek,
Anna Wróblewska
et al.

Abstract: State-of-the-art solutions for Natural Language Processing (NLP) are able to capture a broad range of contexts, like the sentence-level context or document-level context for short documents. But these solutions are still struggling when it comes to longer, real-world documents with the information encoded in the spatial structure of the document, such as page elements like tables, forms, headers, openings or footers; complex page layout or presence of multiple pages.To encourage progress on deeper and more com… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(10 citation statements)
references
References 8 publications
0
10
0
Order By: Relevance
“…DocFormer achieves 96.33% F1 on this dataset besting all prior *-base and virtually all *large variants tying with TILT-large [40] which has higher number of parameters. Kleister-NDA [16]: dataset consists of legal NDA documents. The task with Kleister-NDA data is to extract the values of four fixed labels.…”
Section: Entity Extraction Taskmentioning
confidence: 99%
See 2 more Smart Citations
“…DocFormer achieves 96.33% F1 on this dataset besting all prior *-base and virtually all *large variants tying with TILT-large [40] which has higher number of parameters. Kleister-NDA [16]: dataset consists of legal NDA documents. The task with Kleister-NDA data is to extract the values of four fixed labels.…”
Section: Entity Extraction Taskmentioning
confidence: 99%
“…We present all the hyper-parameters in Table 11 used for pre-training and fine-tuning DocFormer . We fine-tune on downstream tasks on the same number of epochs as prior art [55,56,25]: FUNSD [17], Kleister-NDA [16] datasets were fine-tuned for 100 epochs. CORD [46] for 200 epochs.…”
Section: Implementation Detailsmentioning
confidence: 99%
See 1 more Smart Citation
“…The standard approach to information extraction is a two stage process, that requires an initial OCR step followed by a second information localization step. Localization can be performed on the extracted text sequence, by training a NER model [5]. With such an approach however spatial information is lost.…”
Section: Related Workmentioning
confidence: 99%
“…Visually rich document understanding includes many tasks, such as layout recognization (Zhong et al, 2019b;Li et al, 2020), table detection and recognition (Li et al, 2019a;Zhong et al, 2019a) and key information extraction (Graliński et al, 2020;Guo et al, 2019;Huang et al, 2019;G. Jaume and Thiran, 2019;Majumder et al, 2020).…”
Section: Related Workmentioning
confidence: 99%