2009
DOI: 10.1007/s10462-009-9135-4
|View full text |Cite
|
Sign up to set email alerts
|

Extending Zipf’s law to n-grams for large corpora

Abstract: Experiments show that for a large corpus, Zipf's law does not hold for all ranks of words: the frequencies fall below those predicted by Zipf's law for ranks greater than about 5,000 word types in the English language and about 30,000 word types in the inflected languages Irish and Latin. It also does not hold for syllables or words in the syllable-based languages, Chinese or Vietnamese. However, when single words are combined together with word n-grams in one list and put in rank order, the frequency of token… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
10
0

Year Published

2012
2012
2023
2023

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 24 publications
(12 citation statements)
references
References 24 publications
2
10
0
Order By: Relevance
“…This is not the case for a specific written language corpus, but for about any written language corpus. Indeed, Zipf's law has been found across many languages and language groups [19,34,36] and has not only been found for word frequencies, but also for other linguistic phenomena, such as part-of-speech tags [36], -grams [18] and number words [11]. What's more, Zipf's law has even been found for non-human communication, such as for dolphin whistles [32] and gesturing in gorillas [16].…”
Section: Introductionmentioning
confidence: 99%
“…This is not the case for a specific written language corpus, but for about any written language corpus. Indeed, Zipf's law has been found across many languages and language groups [19,34,36] and has not only been found for word frequencies, but also for other linguistic phenomena, such as part-of-speech tags [36], -grams [18] and number words [11]. What's more, Zipf's law has even been found for non-human communication, such as for dolphin whistles [32] and gesturing in gorillas [16].…”
Section: Introductionmentioning
confidence: 99%
“…This phenomenon is also present when investigating lemmas [6]. Similarly, n-gram phrases fit this distribution [7,8]. Therefore, to some extent, repetitions of identical n-grams are likely to be found in large corpora.…”
Section: Reuse Of Partial Resultsmentioning
confidence: 89%
“…Baayen (2008, p. 226) elaborates the problem of sample independence of Zipf's law. In fact, Ha, Hanna, Ming, and Smith (2009) propose an extension of Zipf's law for large corpora. Egghe (2000) shows that the rank-frequency distribution follows Zipf's law with an additional exponent.…”
Section: Examining the N-gram Distributionsmentioning
confidence: 99%