2013
DOI: 10.1093/llc/fqt039
|View full text |Cite
|
Sign up to set email alerts
|

Mind your corpus: systematic errors in authorship attribution

Abstract: In computational stylistics, any influence of unwanted noise-e.g. caused by an untidilyprepared corpus-might lead to biased or false results. Relying on contaminated data is quite similar to using dirty test tubes in a laboratory: it inescapably means falling into systematic error. An important question is what degree of nonchalance is acceptable to obtain sufficiently reliable results. The present study attempts to verify the impact of unwanted noise in a series of experiments conducted on several corpora of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
14
0
3

Year Published

2017
2017
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 34 publications
(19 citation statements)
references
References 18 publications
(9 reference statements)
2
14
0
3
Order By: Relevance
“…9b). This result confirmed the robustness of stylometric methods with (slightly) noisy texts, as already suggested by Eder (2012) and . * Ten texts were excluded from the stylometric analysis: nine because they were too short (under 500 words) and one because it had already been attributed to Musil on the basis of philological proof (cf.…”
Section: Conclusion and Future Perspectivessupporting
confidence: 86%
“…9b). This result confirmed the robustness of stylometric methods with (slightly) noisy texts, as already suggested by Eder (2012) and . * Ten texts were excluded from the stylometric analysis: nine because they were too short (under 500 words) and one because it had already been attributed to Musil on the basis of philological proof (cf.…”
Section: Conclusion and Future Perspectivessupporting
confidence: 86%
“…9b). This result confirmed the robustness of stylometric methods with (slightly) noisy texts, as already suggested by Eder (2012) and Franzini et al (2018). * Ten texts were excluded from the stylometric analysis: nine because they were too short (under 500 words) and one because it had already been attributed to Musil on the basis of philological proof (cf.…”
Section: Conclusion and Future Perspectivessupporting
confidence: 73%
“…A closer analysis of the noisiest texts confirmed that these peaks issue primarily from errors in image segmentation: in many cases, the correct reading order was not respected, or text regions from different articles were incorrectly intermixed. Apart from these errors, however, the situation appeared quite promising, with a mean character error rate of 2-3%, which is generally considered as a high standard in OCR quality (Fink, Schulz, and Springmann 2017) and which may not influence significantly a stylometric analysis (Eder 2012). For these reasons, instead of proceeding with a manual transcription of the TSZ articles, I decided simply to re-apply the OCR process, while improving the quality of the process as much as possible.…”
Section: The Klagenfurter Ausgabe and Ocrmentioning
confidence: 99%
“…While OCR errors remain part of a wider problem of dealing with "noise" in text mining [23], their impact varies depending on the task performed [24]. NLP tasks such as machine translation, sentence boundary detection, tokenization, and part-of-speech tagging on text among others can all be compromised by OCR errors [25].…”
Section: Ocr Errors and Topic Modelingmentioning
confidence: 99%