Testing the Relationship between Word Length, Frequency, and Predictability Based on the German Reference Corpus

Koplenig, Alexander; Kupietz, Marc; Wolfer, Sascha

doi:10.1111/cogs.13090

Cited by 8 publications

(8 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“… Note . Column 1: Original pair number, column 2: Polarity of the word, columns 3 and 4: Original English stimuli and their associated frequencies in the Corpus of Contemporary American English (COCA, Davies, 2010), columns 5 and 6: German translations for the original stimuli and their associated lemma frequencies in frequency dataset DeReKoGram (Koplenig, Kupietz, & Wolfer, 2022; Wolfer, Koplenig, Kupietz, & Müller‐Spitzer, 2023) which is based on the German Reference Corpus (Kupietz, Belica, Keibel, & Witt, 2010). Frequencies for translations marked with an asterisk are taken from the bigram frequency dataset.…”

Section: Methodsmentioning

confidence: 99%

“…We use DeReKoGram (Koplenig et al., 2022; Wolfer et al., 2023), a uni‐, bi‐, and trigram frequency dataset based on approx. 43 billion tokens from the German Reference Corpus (Kupietz et al., 2010) to extract frequencies (Section 3.1), trigram frequencies of binomial expressions (Section 3.2), as well as the verbs used as a comparison set for the distributional semantics analysis (Section 3.4).…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Is More Always Better? Testing the Addition Bias for German Language Statistics

Wolfer

2023

Cognitive Science

Self Cite

View full text Add to dashboard Cite

This replication study aims to investigate a potential bias toward addition in the German language, building upon previous findings of Winter and colleagues who identified a similar bias in English. Our results confirm a bias in word frequencies and binomial expressions, aligning with these previous findings. However, the analysis of distributional semantics based on word vectors did not yield consistent results for German. Furthermore, our study emphasizes the crucial role of selecting appropriate translational equivalents, highlighting the significance of considering language‐specific factors when testing for such biases for languages other than English.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Is More Always Better? Testing the Addition Bias for German Language Statistics

Wolfer

2023

Cognitive Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…We included begin-and end-of-document markers «START» and «END». Please refer to Koplenig et al [11] for an explanation. Note that n-grams crossing sentence boundaries can be excluded by deleting n-grams based on the POS tag $., which identifies the end of a sentence.…”

Section: Data Selectionmentioning

confidence: 99%

“…For Python, we also show how to train smoothed n-gram language models with DeReKoGram. For Stata and R code for another linguistic application please see the supplementary material of a previous study [11].…”

Section: Introductionmentioning

confidence: 99%

Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German

Wolfer,

Koplenig,

Kupietz

et al. 2023

Data

Self Cite

View full text Add to dashboard Cite

We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.

show abstract

“…Over the years, Zipf's law of abbreviation has been empirically investigated numerous times (Wimmer et al, 1994;Sigurd et al, 2004;Kanwal et al, 2017;Koplenig et al, 2022;Levshina, 2022;Petrini et al, 2022Petrini et al, , 2023. We now present a formal derivation of Zipf's law of abbreviation by viewing it as an instantiation of the lexicalization problem.…”

Section: Introductionmentioning

confidence: 99%

Revisiting the Optimality of Word Lengths

Pimentel,

Meister,

Wilcox

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Zipf (1935) posited that wordforms are optimized to minimize utterances' communicative costs. Under the assumption that cost is given by an utterance's length, he supported this claim by showing that words' lengths are inversely correlated with their frequencies. Communicative cost, however, can be operationalized in different ways. Piantadosi et al. (2011) claim that cost should be measured as the distance between an utterance's information rate and channel capacity, which we dub the channel capacity hypothesis (CCH) here. Following this logic, they then proposed that a word's length should be proportional to the expected value of its surprisal (negative log-probability in context). In this work, we show that Piantadosi et al.'s derivation does not minimize CCH's cost, but rather a lower bound, which we term CCH ↓ . We propose a novel derivation, suggesting an improved way to minimize CCH's cost. Under this method, we find that a language's word lengths should instead be proportional to the surprisal's expectation plus its variance-tomean ratio. Experimentally, we compare these three communicative cost functions: Zipf's, CCH ↓ , and CCH. Across 13 languages and several experimental settings, we find that length is better predicted by frequency than either of the other hypotheses. In fact, when surprisal's expectation, or expectation plus variance-to-mean ratio, is estimated using better language models, it leads to worse word length predictions. We take these results as evidence that Zipf's longstanding hypothesis holds.https://github.com/tpimentelms/ optimality-of-word-lengths

show abstract

Testing the Relationship between Word Length, Frequency, and Predictability Based on the German Reference Corpus

Cited by 8 publications

References 13 publications

Is More Always Better? Testing the Addition Bias for German Language Statistics

Is More Always Better? Testing the Addition Bias for German Language Statistics

Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German

Revisiting the Optimality of Word Lengths

Contact Info

Product

Resources

About