In this paper, we provide quantitative evidence showing that languages spoken by many second language speakers tend to have relatively small nominal case systems or no nominal case at all. In our sample, all languages with more than 50% second language speakers had no nominal case. The negative association between the number of second language speakers and nominal case complexity generalizes to different language areas and families. As there are many studies attesting to the difficulty of acquiring morphological case in second language acquisition, this result supports the idea that languages adapt to the cognitive constraints of their speakers, as well as to the sociolinguistic niches of their speaking communities. We discuss our results with respect to sociolinguistic typology and the Linguistic Niche Hypothesis, as well as with respect to qualitative data from historical linguistics. All in all, multiple lines of evidence converge on the idea that morphosyntactic complexity is reduced by a high degree of language contact involving adult learners.
Abstract:The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics and language sciences more generally. Information theory gives us tools at hand to measure precisely the average amount of choice associated with words: the word entropy. Here, we use three parallel corpora, encompassing ca. 450 million words in 1916 texts and 1259 languages, to tackle some of the major conceptual and practical problems of word entropy estimation: dependence on text size, register, style and estimation method, as well as non-independence of words in co-text. We present two main findings: Firstly, word entropies display relatively narrow, unimodal distributions. There is no language in our sample with a unigram entropy of less than six bits/word. We argue that this is in line with information-theoretic models of communication. Languages are held in a narrow range by two fundamental pressures: word learnability and word expressivity, with a potential bias towards expressivity. Secondly, there is a strong linear relationship between unigram entropies and entropy rates. The entropy difference between words with and without co-textual information is narrowly distributed around ca. three bits/word. In other words, knowing the preceding text reduces the uncertainty of words by roughly the same amount across languages of the world.
The problem of compression in standard information theory consists of assigning codes as short as possible to numbers. Here we consider the problem of optimal coding -under an arbitrary coding scheme -and show that it predicts Zipf's law of abbreviation, namely a tendency in natural languages for more frequent words to be shorter. We apply this result to investigate optimal coding also under so-called nonsingular coding, a scheme where unique segmentation is not warranted but codes stand for a distinct number. Optimal non-singular coding predicts that the length of a word should grow approximately as the logarithm of its frequency rank, which is again consistent with Zipf's law of abbreviation. Optimal non-singular coding in combination with the maximum entropy principle also predicts Zipf's rank-frequency distribution. Furthermore, our findings on optimal non-singular coding challenge common beliefs about random typing. It turns out that random typing is in fact an optimal coding process, in stark contrast with the common assumption that it is detached from cost cutting considerations. Finally, we discuss the implications of optimal coding for the construction of a compact theory of Zipfian laws more generally as well as other linguistic laws.
There are more than 7,000 languages spoken in the world today 1. It has been argued that the natural and social environment of languages drives this diversity 2-13. However, a fundamental question is how strong are environmental pressures, and does neutral drift suffice as a mechanism to explain diversification? We estimate the phylogenetic signals of geographic dimensions, distance to water, climate and population size on more than 6,000 phylogenetic trees of 46 language families. Phylogenetic signals of environmental factors are generally stronger than expected under the null hypothesis of no relationship with the shape of family trees. Importantly, they are also-in most cases-not compatible with neutral drift models of constant-rate change across the family tree branches. Our results suggest that language diversification is driven by further adaptive and non-adaptive pressures. Language diversity cannot be understood without modelling the pressures that physical, ecological and social factors exert on language users in different environments across the globe. Present-day linguistic diversity is non-randomly distributed across the globe, forming patterns at multiple levels. For example, more than 7,000 languages are currently spoken, and these can be classified into a few hundred language families 1. Each family contains (ideally) all-and only-descendants of a single ancestral protolanguage. Given that languages evolve through time in a manner similar to the evolution of biological species-through splits, extinctions and horizontal exchange-a language family can be approximated by a structured family tree (or phylogeny) that comprises a set of languages spoken by actual human groups occupying geographical space. An intriguing observation is that not only individual languages are non-randomly distributed across the globe; language families are too: some families are huge, spanning vast areas, while others are much more circumscribed. It has been proposed that this patterning reflects ancestral historical events and processes, such as demographic migrations and spreads, or language shift through elite dominance 14. Additionally, there is an emerging view that language diversification cannot be fully understood except in the wider context of physical, cultural and biological variation 15-17. A fundamental question, then, is why and how do language family trees unfold? Is linguistic diversification a self-contained process, or do pressures related to geographic and demographic dimensions drive diversification and shape language family trees? The classic view holds that explanations of diversity have to be sought 'first on the basis of recognized processes of internal change' 18. Here, 'internal' changes are either seen as a 'rather directionless pursuit of individual forms down the branches of the family tree' 19 or as regular phenomena such as sound change and analogy 19. Internal changes are often associated with the term 'linguistic drift' 20 , which
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.