2019 15th International Conference on eScience (eScience) 2019
DOI: 10.1109/escience.2019.00033
|View full text |Cite
|
Sign up to set email alerts
|

defoe: A Spark-Based Toolbox for Analysing Digital Historical Textual Data

Abstract: This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two HPC environments, as well as on desktops. Result… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6

Relationship

2
4

Authors

Journals

citations
Cited by 7 publications
(9 citation statements)
references
References 7 publications
0
9
0
Order By: Relevance
“…See Figure 5. The new NLS object model is added to our previous collection of object models [4]. We have used it to analyse the Encyclopaedia Britannica, the Gazetteers of Scotland, and the Chapbooks printed in Scotland, as described in Section VI.…”
Section: A Ingestion Of Nls Digital Collectionsmentioning
confidence: 99%
See 1 more Smart Citation
“…See Figure 5. The new NLS object model is added to our previous collection of object models [4]. We have used it to analyse the Encyclopaedia Britannica, the Gazetteers of Scotland, and the Chapbooks printed in Scotland, as described in Section VI.…”
Section: A Ingestion Of Nls Digital Collectionsmentioning
confidence: 99%
“…This section gives an overview of our previous work on defoe [4], providing the necessary background for the functionalities presented in Section IV.…”
Section: Introductionmentioning
confidence: 99%
“…The results indicated a significant improvement in the accuracy of Naive Bayes and Logistic Regression with respect to Decision Trees. The research presented in [23] introduced a new portable and scalable digital eScience toolbox called "defoe" that can extract knowledge from historical data by running text-mining queries across large historical datasets in parallel using Apache Spark. The experiments were done using five different historical datasets and two different high-performance computing (HPC) environments as well as on desktop [23].…”
Section: Research On Apache Sparkmentioning
confidence: 99%
“…The research presented in [23] introduced a new portable and scalable digital eScience toolbox called "defoe" that can extract knowledge from historical data by running text-mining queries across large historical datasets in parallel using Apache Spark. The experiments were done using five different historical datasets and two different high-performance computing (HPC) environments as well as on desktop [23]. The authors showed that "defoe" can allow researchers to query multiple datasets in parallel from a single command-line interface [23].…”
Section: Research On Apache Sparkmentioning
confidence: 99%
See 1 more Smart Citation