2017
DOI: 10.5334/jors.164
|View full text |Cite
|
Sign up to set email alerts
|

The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents

Abstract: In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they typically require additional training (for example, scholars have to learn how to use the command line) or are difficult to automate without programming skills. The Giles Ecosystem is a distributed system based o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(2 citation statements)
references
References 2 publications
(2 reference statements)
0
2
0
Order By: Relevance
“…(Trevathan et al 1999;Gluckman et al 2009) For each individual we gather their academic publication from Web of Science (Reuters 2012) and transform these into plain text files and then into a comprehensive corpus using the Giles framework. (Damerow et al 2017) Our full corpus contains 6, 456 full-text publications from 1971 through 2017. The corpus is then hand curated to identify errors, such as wrongly assigning work to individuals.…”
Section: Methodsmentioning
confidence: 99%
“…(Trevathan et al 1999;Gluckman et al 2009) For each individual we gather their academic publication from Web of Science (Reuters 2012) and transform these into plain text files and then into a comprehensive corpus using the Giles framework. (Damerow et al 2017) Our full corpus contains 6, 456 full-text publications from 1971 through 2017. The corpus is then hand curated to identify errors, such as wrongly assigning work to individuals.…”
Section: Methodsmentioning
confidence: 99%
“…Applying CTA to archaeological texts is much less common, but it has recently been used to analyze archaeological publications (e.g., Park et al 2020; Schmidt and Marwick 2020). Finally, OCR is often used as a digitizing tool by archaeologists (e.g., Heath et al 2019; McManamon et al 2017) and is also common in the digital humanities, along with text extraction (Damerow et al 2017; Pintus et al 2015). These spatial and textual analysis methods have been combined in archaeological research through what Murrieta-Flores and Gregory (2015) refer to as Geographic Text Analysis (GTA).…”
Section: Introductionmentioning
confidence: 99%