Natural language data -sociolinguistic, historical, and other types of corpora -should not be analyzed with fixed-effects regression models, such as VARBRUL and GoldVarb use. This is because tokens of linguistic variables are rarely independent; they are usually grouped and correlated according to factors like speaker (or text) and word.Fixed-effects models can estimate the effects of higher-level "nesting" predictors (like speaker gender or word frequency), but they cannot be accurate if there exist any individual effects of lower-level "nested" predictors (like speaker or word). Mixed-effects models are designed to take these multiple levels of variation into account at the same time. Because many predictors of interest are in a nesting relationship with speaker or word, mixed models give more accurate quantitative estimates of their effect sizes, and especially of their statistical significance. The problems with fixed-effects models are only exacerbated by the token imbalances that exist across speakers and words in naturalistic speech, while mixed-effects models handle these imbalances well. This article demonstrates these and other advantages of mixed models, using data on /t, d/-deletion taken from the Buckeye Corpus as well as other real and simulated data sets.
While keyword analysis has become an essential tool in corpus-based work, the question of how to quantify keyness has been subject to considerable methodological debate. This has given rise to a variety of computerized metrics for detecting and ranking candidate items based on the comparison of a target to a reference corpus. Building on previous work, the present paper starts out by delineating four dimensions of keyness, which distinguish between frequency- and dispersion-related perspectives and identify substantively different aspects of typicalness. Existing measures are then organized according to these dimensions and evaluated with regard to two specific criteria, their interpretability and reliability. The first of these, which has been neglected in previous work, is a critical feature if metrics are to offer informative indications of keyness. The second criterion is performance-oriented and reflects the degree to which a metric produces stable and replicable rankings across repeated studies on the same pair of text varieties. Our illustrative analysis, which deals with the identification of key verbs in academic writing, shows considerable differences among indicators with regard to these two criteria. Our findings provide further support for the superiority of the Wilcoxon rank sum test and allow us to identify, within each dimension of keyness, metrics that may be given preference in applied work in light of our criteria.
The traditional approach to measuring lexical dispersion is to form corpus parts of equal size and then compare the occurrence rate of an item across these units. In recent methodological work, this strategy has met with criticism due to its ignorance to corpus structure. Dispersion, it is argued, should be measured across linguistically meaningful units such as the individual text files constituting the corpus. Though desirable on linguistic grounds, a shift to texts as the unit of analysis raises new methodological issues. While the ability of dispersion measures to handle unevenly-sized corpus units has received attention in the literature, the question of how existing metrics perform in these novel settings has only been partly addressed. This paper aims to shed light on relevant statistical properties of a wide range of text-level dispersion measures. Specifically, we consider the robustness of different indicators, i.e. whether they are (overly) sensitive to data situations that can arise when texts differ (considerably) in length. We use hypothetical data scenarios to identify weak spots in existing measures, and then propose modifications to DP- and DA-related indexes to implement useful statistical properties and effect more resistant estimators. Along with the other measures, these are then evaluated against actual corpus data drawn from the BNC. We observe that adapted DP- and DA-variants perform at least as well as their original versions. Our permutation-based simulation study also demonstrates that Carroll’s D2 shows the same weakness as Juilland’s D, i.e. a noticeable sensitivity to the number of units that enter the analysis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.