2020
DOI: 10.3390/e22010126
|View full text |Cite
|
Sign up to set email alerts
|

A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics

Abstract: The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns reg… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
56
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 52 publications
(56 citation statements)
references
References 62 publications
0
56
0
Order By: Relevance
“…We use the API functionality of Project Gutenberg [17] to obtain our document corpus and the natural-language-processing (NLP) library spaCy [20] to extract an ordered punctuation sequence from each document. Using data from Project Gutenberg requires various filtering and cleaning steps before it is possible to meaningfully perform statistical analysis [14]. We describe our steps below.…”
Section: Data Setmentioning
confidence: 99%
See 4 more Smart Citations
“…We use the API functionality of Project Gutenberg [17] to obtain our document corpus and the natural-language-processing (NLP) library spaCy [20] to extract an ordered punctuation sequence from each document. Using data from Project Gutenberg requires various filtering and cleaning steps before it is possible to meaningfully perform statistical analysis [14]. We describe our steps below.…”
Section: Data Setmentioning
confidence: 99%
“…In some our computational experiments, we use the following metadata: author birth year, author death year, and document "bookshelf" (which we term document "genre", as that is what it appears to represent). The authors of [14] pointed out recently that "bookshelf" may be better suited than "subject" to practical purposes such as text classification, because the former constitute broader categories and provide a unique assignment of labels to documents.…”
Section: Data Setmentioning
confidence: 99%
See 3 more Smart Citations