Evaluation of text-level measures of lexical dispersion: Robustness and consistency

Sönning, Lukas

doi:10.31234/osf.io/h9mvs

Cited by 2 publications

(4 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In corpus B, instances are more densely clustered, and there are large stretches of text where the item does not occur. In the corpus-linguistic sense, then, the dispersion of the item is higher in corpus A (see Gries 2008Gries , 2020Sönning 2022b). This dot marks how often the item appeared in the text.…”

Section: Dispersion: Corpus-linguistic Vs Statistical Sensementioning

confidence: 99%

“…As Table 1 shows, these keyness dimensions allow us to form four linguistically meaningful classes of metrics. For reasons of space, we cannot provide details about the individual measures here, and we refer the reader to Gabrielatos (2018), Rayson & Potts (2020), Gries (2020), and Sönning (2022a, 2022b. The four-way arrangement in Table 1 offers a constructive point of departure for keyness analysis, since it requires the analyst to first consider which features of keyness to emphasize when looking for typical items in the target corpus.…”

Section: Dimensions Of Keynessmentioning

confidence: 99%

See 1 more Smart Citation

Count regression models for keyness analysis

Sönning¹

2022

Preprint

Self Cite

View full text Add to dashboard Cite

A wide variety of measures have been used in previous work to assess the keyness of items in a particular domain of language use. The present paper explores an approach to keyword analysis based on regression modeling. Specifically, we use a form of negative binomial regression, which offers a number of advantages compared to existing techniques for identifying typical items in a target corpus. Thus, it is responsive to the multidimensional nature of keyness and can address multiple aspects of typicalness simultaneously, using a single statistical model. Further, metrics of interest can be enriched with confidence intervals, which allows us to isolate descriptive and inferential indicators of keyness. Finally, all quantities are based on a text-level analysis, which accounts for the fact that the target and reference corpus consist of text files and adjusts uncertainty estimates accordingly. As an illustrative case study, we rely on COCA to identify key verbs in academic writing and demonstrate how negative binomial regression may be used to this end. Our checks on the coverage rate of the 95% confidence intervals indicate that this model seems to be adequate for purposes of statistical inference. Due consideration will also be given to the limitations of this procedure, and we conclude by outlining the kinds of keyness analyses for which count regression models may be a worthwhile approach. The online supplementary material for this paper provides data and R code for the implementation of keyness regression.

show abstract

Section: Dispersion: Corpus-linguistic Vs Statistical Sensementioning

confidence: 99%

Section: Dimensions Of Keynessmentioning

confidence: 99%

Count regression models for keyness analysis

Sönning¹

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…texts) of different length (cf. Gries 2020; Sönning 2022a). An overview of different dispersion measures is provided in Gries (2020) and Sönning (2022a).…”

Section: Generality: Dispersion In the Target Corpusmentioning

confidence: 99%

“…Gries 2020; Sönning 2022a). An overview of different dispersion measures is provided in Gries (2020) and Sönning (2022a). For an assessment of the generality of an item, we will consider the following measures: D (Juilland et al 1970), D2 (Carroll 1970), Sadj (Rosengren 1972), DP (Gries 2008;Lijffit & Gries 2012;, DA (Wilcox 1973;, and DKL (Gries 2020(Gries , 2021.…”

Section: Generality: Dispersion In the Target Corpusmentioning

confidence: 99%

Evaluation of keyness metrics: Reliability and interpretability

Sönning¹

2022

Preprint

Self Cite

View full text Add to dashboard Cite

While keyword analysis has become an essential tool in corpus-based work, the question of how to quantify keyness has been subject to considerable methodological debate. This has given rise to a variety of computerized metrics for detecting and ranking candidate items based on the comparison of a target to a reference corpus. Building on previous work, the present paper starts out by delineating four dimensions of keyness, which distinguish between frequency- and dispersion-related perspectives and identify substantively different aspects of typicalness. Existing measures are then organized according to these dimensions and evaluated with regard to two specific criteria, their interpretability and reliability. The first of these, which has been neglected in previous work, is a critical feature if metrics are to offer informative indications of keyness. The second criterion is performance-oriented and reflects the degree to which a metric produces stable and replicable rankings across repeated studies on the same pair of text varieties. Our illustrative analysis, which deals with the identification of key verbs in academic writing, shows considerable differences among indicators with regard to these two criteria. Our findings provide further support for the superiority of the Wilcoxon rank sum test and allow us to identify, within each dimension of keyness, metrics that may be given preference in applied work in light of our criteria.

show abstract

Evaluation of text-level measures of lexical dispersion: Robustness and consistency

Cited by 2 publications

References 13 publications

Count regression models for keyness analysis

Count regression models for keyness analysis

Evaluation of keyness metrics: Reliability and interpretability

Contact Info

Product

Resources

About