2017
DOI: 10.3390/e19060275
|View full text |Cite
|
Sign up to set email alerts
|

The Entropy of Words—Learnability and Expressivity across More than 1000 Languages

Abstract: Abstract:The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics and language sciences more generally. Information theory gives us tools at hand to measure precisely the average amount of choice associated with words: the word entropy. Here, we use three parallel corpora, encompassing ca. 450 million words in 1916 texts and 1259 languages, to tackle some of the major conceptual and practical problems of word en… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
106
0
1

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
3
2

Relationship

1
9

Authors

Journals

citations
Cited by 87 publications
(109 citation statements)
references
References 61 publications
2
106
0
1
Order By: Relevance
“…For technical reasons, we are limited to calculating mutual information based on the joint frequencies of part-of-speech pairs, rather than wordforms. The reason we use part-of-speech tags is that getting a reliable estimate of mutual information from observed frequencies of wordforms is statistically difficult, requiring very large samples to overcome bias (Archer, Park, & Pillow, 2013;Basharin, 1959;Bentz, Alikaniotis, Cysouw, & Ferrer-i-Cancho, 2017;Futrell et al, 2019;Miller, 1955;Paninski, 2003). The mutual information estimation problem is less severe, however, when we are looking at joint counts over coarser-grained categories, such that there is not a long tail of one-off forms.…”
Section: Word Order Preferencesmentioning
confidence: 99%
“…For technical reasons, we are limited to calculating mutual information based on the joint frequencies of part-of-speech pairs, rather than wordforms. The reason we use part-of-speech tags is that getting a reliable estimate of mutual information from observed frequencies of wordforms is statistically difficult, requiring very large samples to overcome bias (Archer, Park, & Pillow, 2013;Basharin, 1959;Bentz, Alikaniotis, Cysouw, & Ferrer-i-Cancho, 2017;Futrell et al, 2019;Miller, 1955;Paninski, 2003). The mutual information estimation problem is less severe, however, when we are looking at joint counts over coarser-grained categories, such that there is not a long tail of one-off forms.…”
Section: Word Order Preferencesmentioning
confidence: 99%
“…Supporting this idea, studies on the developing Nicaraguan sign language have shown that complex linguistic structure emerges over multiple cohorts of learners (Senghas, Kita, & Ozyurek, 2004), and work on pidgins has suggested that new child learners are required in order to develop recursion (Bickerton, 1984). Second, it affects the reasoning and predictions made about the structure of human lexicons over time: from understanding trends in metaphorical mappings (Xu, Malt, & Srinivasan, 2017) to measuring the entropy and informativity of words (Bentz, Alikaniotis, Cysouw, & Ferrer-i-Cancho, 2017). Going beyond language evolution and change, this conclusion has already influenced work on a wide range of human behaviors.…”
Section: Introductionmentioning
confidence: 99%
“…So, first of all, it is necessary to prove that the text corresponds to this style. To do this, it is necessary to determine the unconditional semantic entropy (5) for the detection of journalistic style signs, using the technique [16], which will also be necessary in determining the n-grams [17] when forming a model of the propagandist's psycholinguistic portrait.…”
Section: Stage 2 a Typical Psycholinguistic Profile Constructingmentioning
confidence: 99%