Neues von KorAP

Kupietz, Marc; Diewald, Nils; Margaretha, Eliza; Bodmer, Franck; Stallkamp, Helge; Harders, Peter

doi:10.1515/9783110622591-021

Cited by 1 publication

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DeReKo currently contains more than 50 billion [henceforth, b] tokens and comprises a multitude of genres, such as (a large number of) newspaper texts, fiction, or specialized texts, with a current growth rate of ∼3b words per year (Kupietz et al, 2018). Tokenization was carried out using the KorAP tokenizer (Kupietz et al, 2021), the deterministic finite automaton scanning rules of which are based on those of the Apache Lucene tokenizer. Part‐of‐speech tagging and lemmatization is based on TreeTagger (Schmid, 1994).…”

Section: Data and Preprocessingmentioning

confidence: 99%

Testing the Relationship between Word Length, Frequency, and Predictability Based on the German Reference Corpus

2022

Self Cite

View full text Add to dashboard Cite

In a recent article, Meylan and Griffiths (Meylan & Griffiths, 2021, henceforth, M&G) focus their attention on the significant methodological challenges that can arise when using large-scale linguistic corpora. To this end, M&G revisit a well-known result of Piantadosi, Tily, and Gibson (2011, henceforth, PT&G) who argue that average information content is a better predictor of word length than word frequency. We applaud M&G who conducted a very important study that should be read by any researcher interested in working with large-scale corpora. The fact that M&G mostly failed to find clear evidence in favor of PT&G's main finding motivated us to test PT&G's idea on a subset of the largest archive of German language texts designed for linguistic research, the German Reference Corpus consisting of ∼43 billion words. We only find very little support for the primary data point reported by PT&G.

show abstract

Section: Data and Preprocessingmentioning

confidence: 99%