We present a proof-of-concept study that sketches the use of compression algorithms to assess Kolmogorov complexity, which is a text-based, quantitative, holistic, and global measure of structural surface redundancy. Kolmogorov complexity has been used to explore cross-linguistic complexity variation in linguistic typology research, but we are the first to apply it to naturalistic second language acquisition (SLA) data. We specifically investigate the relationship between the complexity of second language (L2) English essays and the amount of instruction the essay writers have received. Analysis shows that increased L2 instructional exposure predicts increased overall complexity and increased morphological complexity, but decreased syntactic complexity (defined here as less rigid word order). While the relationship between L2 instructional exposure and complexity is robust across a number of first language (L1) backgrounds, L1 background does predict overall complexity levels.
This article utilises an innovative, information-theoretic metric to assess complexity variation across written and spoken registers of British English. This is novel because previous research on language complexity mainly analysed complexity variation in typological data, single language case studies or geographical varieties of the same language. The measure boils down to Kolmogorov complexity which can be conveniently approximated with off-the-shelf compression programs. Essentially, text samples that can be compressed more efficiently count as linguistically simple. The dataset covers a wide range of traditional written and spoken registers (e.g. broadsheet newspapers, courtroom debate or face-to-face conversation), as sampled in the British National Corpus. It turns out that Kolmogorov-based register variation coincides with register formality such that informal registers are overall and morphologically less complex than more formal registers, but more complex in regard to syntax (defined here as rigid word order). Generally, the results show that written and spoken registers vary along a continuum, and significantly trade-off morphological against syntactic complexity (and vice versa). Finally, the findings support proposals to view language as a complex adaptive system and demonstrate how language adapts to the situational context of language production and functional-communicative needs of its users.
This article explores measures, operationalisations and effects of rhythm and weight as two constraints on the variation between the s-genitive and the of-genitive. We base the analysis on interchangeable genitives in the news and letters sections of ARCHER (A Representative Corpus of Historical English Registers), which covers the period between 1650 and 1999. Thus, we are ultimately concerned with the applicability of two factors that have their roots in speech (rhythm: phonology; weight: online processing) to an 'unconventional', written data set with a historical dimension. As for weight, we focus on the comparison of simple single-constituent and more complex multi-constituent measurements. Our notion of rhythm centres on the ideally even distribution of stressed and unstressed syllables. We find that in our data set, both rhythm and weight show theoretically unexpected quadratic effects: rhythmically better-behaved s-genitives are not necessarily preferred over of-genitives, and short constituents exhibit odd weight effects. In conclusion, we argue that while rhythm is only a minor player in our data set, the quadratic quirks it exhibits should inspire further study. Weight, on the other hand, is a crucial factor which, however, likewise comes with measurement and modelling complications.
We evaluate corpus-based measures of linguistic complexity obtained using Universal Dependencies (UD) treebanks. We propose a method of estimating robustness of the complexity values obtained using a given measure and a given treebank. The results indicate that measures of syntactic complexity might be on average less robust than those of morphological complexity. We also estimate the validity of complexity measures by comparing the results for very similar languages and checking for unexpected differences. We show that some of those differences that arise can be diminished by using parallel treebanks and, more importantly from the practical point of view, by harmonizing the languagespecific solutions in the UD annotation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.