2011
DOI: 10.1080/09296174.2011.533588
|View full text |Cite
|
Sign up to set email alerts
|

Finding the Minimum Document Length for Reliable Clustering of Multi-Document Natural Language Corpora

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
6
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(6 citation statements)
references
References 6 publications
0
6
0
Order By: Relevance
“…( 34 ) and the difficulty of working on short texts ( 35 ), length criteria deemed necessary to obtain reliable results in authorship attribution can vary, some authors seeming to achieve good results with texts under 1000 words ( 36 , 37 ), while recent systematic studies seem to advocate the study of more substantial texts ( 38 ).…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…( 34 ) and the difficulty of working on short texts ( 35 ), length criteria deemed necessary to obtain reliable results in authorship attribution can vary, some authors seeming to achieve good results with texts under 1000 words ( 36 , 37 ), while recent systematic studies seem to advocate the study of more substantial texts ( 38 ).…”
Section: Methodsmentioning
confidence: 99%
“…To increase the reliability of the analyses, in a corpus with texts of varying length, we decided to select features based on the confidence level and margin of error that we could attain even for the smallest available sample in our corpus. The minimum sample size n was calculated using the following formula ( 38 )n=truep¯false(1truep¯false)true(zetrue)2where truep¯ is the feature mean probability in our corpus, used as an estimate of the population probability π, z is the confidence level, and e is the margin of error of the probability estimate. We set z to obtain a confidence level above 90% and e = 2σ, where σ is the feature standard deviation in the corpus.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…6). Texts that are too short create a problem of reliability, as the observed frequencies may not accurately represent the actual probability of a given variable's appearance (Moisl, 2011). To limit this issue, we removed texts below 1,000 words, a relatively low limit when compared to existing benchmarks (Eder, 2015(Eder, , 2017, but motivated by the necessity to not exclude too many texts.…”
Section: Unsupervised Analysis Of Short Anonymous Textsmentioning
confidence: 99%
“…Given the short length of the texts and the sparsity caused by noise, we implement a procedure to select for analysis only those features that satisfy a criterion of statistical reliability. In this, we follow the procedure suggested by Moisl (2011), in the implementation already used by Cafiero and Camps (2019). To summarize it, features are only retained if they match the desired confidence level and margin of error even for the smallest text in the corpus.…”
Section: Unsupervised Analysis Of Short Anonymous Textsmentioning
confidence: 99%