2023
DOI: 10.1007/978-3-031-28032-0_31
|View full text |Cite
|
Sign up to set email alerts
|

A Benchmark of PDF Information Extraction Tools Using a Multi-task and Multi-domain Evaluation Framework for Academic Documents

Abstract: Extracting information from academic PDF documents is crucial for numerous indexing, retrieval, and analysis use cases. Choosing the best tool to extract specific content elements is difficult because many, technically diverse tools are available, but recent performance benchmarks are rare. Moreover, such benchmarks typically cover only a few content elements like header metadata or bibliographic references and use smaller datasets from specific academic disciplines. We provide a large and diverse evaluation f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
3
0

Year Published

2023
2023
2025
2025

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 46 publications
1
3
0
Order By: Relevance
“…After an extensive comparison with open-source toolkits such as PDFBox 5 and pdfminer.six 6 , we use GROBID 7 for the fulltext extraction from PDF files. Our study validates findings from Meuschke et al (2023) which found GROBID outperforms the other freely-available tools in metadata, reference, and general text extraction tasks from academic PDF documents. We take the S2ORC-JSON format used by and Wang et al (2020) for our full-text schema, which includes complete information parsed from PDF files, such as metadata, authors, and body text with citations, references, sections and etc.…”
Section: Full-text Extractionsupporting
confidence: 87%
“…After an extensive comparison with open-source toolkits such as PDFBox 5 and pdfminer.six 6 , we use GROBID 7 for the fulltext extraction from PDF files. Our study validates findings from Meuschke et al (2023) which found GROBID outperforms the other freely-available tools in metadata, reference, and general text extraction tasks from academic PDF documents. We take the S2ORC-JSON format used by and Wang et al (2020) for our full-text schema, which includes complete information parsed from PDF files, such as metadata, authors, and body text with citations, references, sections and etc.…”
Section: Full-text Extractionsupporting
confidence: 87%
“…PyMuPDF 13 allows access to information about the more underlying details of the PDF file. However, a benchmark demonstrates their imperfect performance (Meuschke et al, 2023).…”
Section: Pdf Information Extraction Softwaresmentioning
confidence: 99%
“…and multimodal contents such as images, tables, equations etc. and their captions (Meuschke et al, 2023;Bast and Korzen, 2017). Additionally, these documents often contain elements that are not directly related to the core content, such as watermarks (Chia et al, 2018), publisher details and header information that serves navigation in collections.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation