2014
DOI: 10.1093/llc/fqu064
|View full text |Cite
|
Sign up to set email alerts
|

Significance testing of word frequencies in corpora

Abstract: Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (2005), the use of the  2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (2001) and Paquot & Bestgen (2009), it is possible to represent the data differen… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
56
0
3

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 116 publications
(60 citation statements)
references
References 45 publications
1
56
0
3
Order By: Relevance
“…However, these tests, according to the way they are used to analyse lexical differences between corpora, are inadequate, as has already been pointed out by several authors, and should no longer be used (Bestgen, 2012(Bestgen, , 2014Brezina & Meyerhoff, 2014;Kilgarriff, 1996Kilgarriff, , 2005Lijffijt, Nevalainen, Säily, Papapetrou, Puolamäki & Mannila, 2016). The aim of this paper is to help researchers to abandon them by explaining in detail the problem they pose and its origin, by showing why several possible solutions are ineffective and by recommending two valid and efficient statistical tests.…”
Section: British American British Americanmentioning
confidence: 99%
“…However, these tests, according to the way they are used to analyse lexical differences between corpora, are inadequate, as has already been pointed out by several authors, and should no longer be used (Bestgen, 2012(Bestgen, , 2014Brezina & Meyerhoff, 2014;Kilgarriff, 1996Kilgarriff, , 2005Lijffijt, Nevalainen, Säily, Papapetrou, Puolamäki & Mannila, 2016). The aim of this paper is to help researchers to abandon them by explaining in detail the problem they pose and its origin, by showing why several possible solutions are ineffective and by recommending two valid and efficient statistical tests.…”
Section: British American British Americanmentioning
confidence: 99%
“…There are several approaches to define the 'keyness' of words in a corpus of text [13]. Here, we use Rayson's (2008) approach to calculate the log-likelihood ratio between the frequency of a word from one dataset compared to the frequency of that word in a reference dataset.…”
Section: Analysis Of Open Comment Feedbackmentioning
confidence: 99%
“…To achieve this, we use the t-test as suggested by Paquot and Bestgen (2009) and Lijffijt et al (2014). One of the benefits of the t-test is that it takes variation within the corpora into account.…”
Section: Distinctiveness and Collocational Strength: N-gram Rankingmentioning
confidence: 99%