Adaptive Communication: Languages with More Non-Native Speakers Tend to Have Fewer Word Forms

Bentz, Christian; Verkerk, Annemarie; Kiela, Douwe; Hill, Felix; Buttery, Paula

doi:10.1371/journal.pone.0128254

Cited by 68 publications

(52 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, several recent studies engage in establishing information-theoretic and corpus-based methods for linguistic typology, i.e. classifying and comparing languages according to their information encoding potential [10,[12][13][14][15][16], and how this potential evolves over time [17][18][19]. Similar methods have been applied to compare and distinguish non-linguistic sequences from written 2 of 34 language [20,21], though it is controversial whether this helps with more fine-grained distinctions between symbolic systems and written language [22,23].…”

Section: Introductionmentioning

confidence: 99%

The Entropy of Words—Learnability and Expressivity across More Than 1000 Languages

Bentz¹,

Alikaniotis²,

Cysouw³

et al. 2017

Preprint

Self Cite

View full text Add to dashboard Cite

Abstract:The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics, and language sciences more generally. Information-theory gives us tools at hand to measure precisely the average amount of choice associated with words -the word entropy. Here we use three parallel corpora -encompassing ca. 450 million words in 1916 texts and 1259 languages -to tackle some of the major conceptual and practical problems of word entropy estimation: dependence on text size, register, style and estimation method, as well as non-independence of words in co-text. We present three main results: 1) a text size of 50K tokens is sufficient for word entropies to stabilize throughout the text, 2) across languages of the world, word entropies display a unimodal distribution that is skewed to the right. This suggests that there is a trade-off between the learnability and expressivity of words across languages of the world. 3) There is a strong linear relationship between unigram entropies and entropy rates, suggesting that they are inherently linked. We discuss the implications of these results for studying the diversity and evolution of languages from an information-theoretic point of view.

show abstract

Section: Introductionmentioning

confidence: 99%

The Entropy of Words—Learnability and Expressivity across More Than 1000 Languages

Bentz¹,

Alikaniotis²,

Cysouw³

et al. 2017

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Rank frequency research has expanded beyond a narrow focus on adult, monolingual, native speakers to demonstrate distinct rank frequency distributions for corpora of varying levels of L2 proficiency across users of natural language [30,31] and artificial command languages [32], L1 attritors who have lost proficiency in their L1 over their lifespan [31], different language combinations of spontaneous codeswitching [33], and in languages with varying proportions of non-native speakers [34].…”

Section: Introductionmentioning

confidence: 99%

“…Distinctive patterns have also been found in child vs. adult caregiver speech [36]: children have a lower α exponent compared to fully proficient adults. Collectively, rank frequency analysis of diachronic and comparative proficiency language data suggest that the exponent serves as an indicator of linguistic complexity [31,[34][35][36][37]. Comparisons of the exponent across multiple languages which vary in their degree of synthetic to analytic structural complexity also support the relationship of the exponent to typological differences and/or to processes of typological change [33,[37][38][39].…”

Section: Introductionmentioning

confidence: 99%

“…WE data offers an ideal testing ground for this. Specifically, because ZL and ML are quantitative measures of lexical diversity, with an inverse relationship, lower diversity should manifest through higher constants (C, α, β) [34]. Comparing parallel synchronic corpora, languages with a larger proportion of L2 speakers have higher constants, and less lexical diversity: adult L2 learners provoke a reduction in the number of word forms in the code diachronically [34].…”

Section: Introductionmentioning

confidence: 99%

“…Specifically, because ZL and ML are quantitative measures of lexical diversity, with an inverse relationship, lower diversity should manifest through higher constants (C, α, β) [34]. Comparing parallel synchronic corpora, languages with a larger proportion of L2 speakers have higher constants, and less lexical diversity: adult L2 learners provoke a reduction in the number of word forms in the code diachronically [34]. In tandem, languages with more synthetic properties (more inflected forms of the same dictionary entries and less reliance on discrete function words to show grammatical properties) display lower constants, and hence greater lexical diversity [18].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Indian English Evolution and Focusing Visible Through Power Laws

et al. 2017

View full text Add to dashboard Cite

New dialect emergence and focusing in language contact settings is difficult to capture and date in terms of global structural dialect stabilization. This paper explores whether diachronic power law frequency distributions can provide evidence of dialect evolution and new dialect focusing, by considering the quantitative frequency characteristics of three diachronic Indian English (IE) corpora (1970s-2008). The results demonstrate that IE consistently follows power law frequency distributions and the corpora are each best fit by Mandelbrot's Law. Diachronic changes in the constants are interpreted as evidence of lexical and syntactic collocational focusing within the process of new dialect formation. Evidence of new dialect focusing is also visible through apparent time comparison of spoken and written data. Age and gender-separated sub-corpora of the most recent corpus show minimal deviation, providing apparent time evidence for emerging IE dialect stability. From these findings, we extend the interpretation of diachronic changes in the β coefficient-as indicative of changes in the degree of synthetic/analytic structure-so that β is also sensitive to grammaticalization and changes in collocational patterns.

show abstract

What makes a language easy to learn? A preregistered study on how systematic structure and community size affect language learnability

Raviv

Kloots²,

Meyer

2021

Cognition

View full text Add to dashboard Cite

Adaptive Communication: Languages with More Non-Native Speakers Tend to Have Fewer Word Forms

Cited by 68 publications

References 48 publications

The Entropy of Words—Learnability and Expressivity across More Than 1000 Languages

The Entropy of Words—Learnability and Expressivity across More Than 1000 Languages

Indian English Evolution and Focusing Visible Through Power Laws

What makes a language easy to learn? A preregistered study on how systematic structure and community size affect language learnability

Contact Info

Product

Resources

About

Adaptive Communication: Languages with More Non-Native Speakers Tend to Have Fewer Word Forms

Cited by 68 publications

References 48 publications

The Entropy of Words&mdash;Learnability and Expressivity across More Than 1000 Languages

The Entropy of Words&mdash;Learnability and Expressivity across More Than 1000 Languages

Indian English Evolution and Focusing Visible Through Power Laws

What makes a language easy to learn? A preregistered study on how systematic structure and community size affect language learnability

Contact Info

Product

Resources

About

The Entropy of Words—Learnability and Expressivity across More Than 1000 Languages

The Entropy of Words—Learnability and Expressivity across More Than 1000 Languages