2022
DOI: 10.1111/cogs.13090
|View full text |Cite
|
Sign up to set email alerts
|

Testing the Relationship between Word Length, Frequency, and Predictability Based on the German Reference Corpus

Abstract: In a recent article, Meylan and Griffiths (Meylan & Griffiths, 2021, henceforth, M&G) focus their attention on the significant methodological challenges that can arise when using large-scale linguistic corpora. To this end, M&G revisit a well-known result of Piantadosi, Tily, and Gibson (2011, henceforth, PT&G) who argue that average information content is a better predictor of word length than word frequency. We applaud M&G who conducted a very important study that should be read by any researcher interested… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
4
1
1

Relationship

3
3

Authors

Journals

citations
Cited by 8 publications
(8 citation statements)
references
References 13 publications
0
8
0
Order By: Relevance
“… Note . Column 1: Original pair number, column 2: Polarity of the word, columns 3 and 4: Original English stimuli and their associated frequencies in the Corpus of Contemporary American English (COCA, Davies, 2010), columns 5 and 6: German translations for the original stimuli and their associated lemma frequencies in frequency dataset DeReKoGram (Koplenig, Kupietz, & Wolfer, 2022; Wolfer, Koplenig, Kupietz, & Müller‐Spitzer, 2023) which is based on the German Reference Corpus (Kupietz, Belica, Keibel, & Witt, 2010). Frequencies for translations marked with an asterisk are taken from the bigram frequency dataset.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“… Note . Column 1: Original pair number, column 2: Polarity of the word, columns 3 and 4: Original English stimuli and their associated frequencies in the Corpus of Contemporary American English (COCA, Davies, 2010), columns 5 and 6: German translations for the original stimuli and their associated lemma frequencies in frequency dataset DeReKoGram (Koplenig, Kupietz, & Wolfer, 2022; Wolfer, Koplenig, Kupietz, & Müller‐Spitzer, 2023) which is based on the German Reference Corpus (Kupietz, Belica, Keibel, & Witt, 2010). Frequencies for translations marked with an asterisk are taken from the bigram frequency dataset.…”
Section: Methodsmentioning
confidence: 99%
“…We use DeReKoGram (Koplenig et al., 2022; Wolfer et al., 2023), a uni‐, bi‐, and trigram frequency dataset based on approx. 43 billion tokens from the German Reference Corpus (Kupietz et al., 2010) to extract frequencies (Section 3.1), trigram frequencies of binomial expressions (Section 3.2), as well as the verbs used as a comparison set for the distributional semantics analysis (Section 3.4).…”
Section: Methodsmentioning
confidence: 99%
“…We included begin-and end-of-document markers «START» and «END». Please refer to Koplenig et al [11] for an explanation. Note that n-grams crossing sentence boundaries can be excluded by deleting n-grams based on the POS tag $., which identifies the end of a sentence.…”
Section: Data Selectionmentioning
confidence: 99%
“…For Python, we also show how to train smoothed n-gram language models with DeReKoGram. For Stata and R code for another linguistic application please see the supplementary material of a previous study [11].…”
Section: Introductionmentioning
confidence: 99%
“…Over the years, Zipf's law of abbreviation has been empirically investigated numerous times (Wimmer et al, 1994;Sigurd et al, 2004;Kanwal et al, 2017;Koplenig et al, 2022;Levshina, 2022;Petrini et al, 2022Petrini et al, , 2023. We now present a formal derivation of Zipf's law of abbreviation by viewing it as an instantiation of the lexicalization problem.…”
Section: Introductionmentioning
confidence: 99%