2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2017
DOI: 10.1109/jcdl.2017.7991565
|View full text |Cite
|
Sign up to set email alerts
|

A Text Extraction Software Benchmark Based on a Synthesized Dataset

Abstract: Text extraction plays an important function for data processing work ows in digital libraries. For example, it is a crucial prerequisite for evaluating the quality of migrated textual documents. Complex le formats make the extraction process error-prone and have made it very challenging to verify the correctness of extraction components. Based on digital preservation and information retrieval scenarios, three quality requirements in terms of e ectiveness of text extraction tools are identi ed: 1) is a certain … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(1 citation statement)
references
References 21 publications
0
1
0
Order By: Relevance
“…Of these three tools, Grobid performed better, which is also evident from our experimental results. The study by Duretec et al [28] presented the evaluation of Tika, DocToText, and Xpdf tools. Among these tools, Tika achieved 58% accuracy in extracting text from PDF documents, in orderly extraction, which is close to our experimental result.…”
Section: Background Studymentioning
confidence: 99%
“…Of these three tools, Grobid performed better, which is also evident from our experimental results. The study by Duretec et al [28] presented the evaluation of Tika, DocToText, and Xpdf tools. Among these tools, Tika achieved 58% accuracy in extracting text from PDF documents, in orderly extraction, which is close to our experimental result.…”
Section: Background Studymentioning
confidence: 99%