Mind your corpus: systematic errors in authorship attribution

Eder, Maciej

doi:10.1093/llc/fqt039

Cited by 34 publications

(19 citation statements)

References 18 publications

(9 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…9b). This result confirmed the robustness of stylometric methods with (slightly) noisy texts, as already suggested by Eder (2012) and . * Ten texts were excluded from the stylometric analysis: nine because they were too short (under 500 words) and one because it had already been attributed to Musil on the basis of philological proof (cf.…”

Section: Conclusion and Future Perspectivessupporting

confidence: 86%

Textual Cultures 12.2

Werner¹

2019

Textual Cultures

View full text Add to dashboard Cite

show abstract

Section: Conclusion and Future Perspectivessupporting

confidence: 86%

Textual Cultures 12.2

Werner¹

2019

Textual Cultures

View full text Add to dashboard Cite

show abstract

“…9b). This result confirmed the robustness of stylometric methods with (slightly) noisy texts, as already suggested by Eder (2012) and Franzini et al (2018). * Ten texts were excluded from the stylometric analysis: nine because they were too short (under 500 words) and one because it had already been attributed to Musil on the basis of philological proof (cf.…”

Section: Conclusion and Future Perspectivessupporting

confidence: 73%

“…A closer analysis of the noisiest texts confirmed that these peaks issue primarily from errors in image segmentation: in many cases, the correct reading order was not respected, or text regions from different articles were incorrectly intermixed. Apart from these errors, however, the situation appeared quite promising, with a mean character error rate of 2-3%, which is generally considered as a high standard in OCR quality (Fink, Schulz, and Springmann 2017) and which may not influence significantly a stylometric analysis (Eder 2012). For these reasons, instead of proceeding with a manual transcription of the TSZ articles, I decided simply to re-apply the OCR process, while improving the quality of the process as much as possible.…”

Section: The Klagenfurter Ausgabe and Ocrmentioning

confidence: 99%

A Digital Edition between Stylometry and OCR:

Rebora

2019

Textual Cultures

View full text Add to dashboard Cite

This article presents the digital edition of Robert Musil’s work (Klagenfurter Ausgabe) and its role in a digital humanities project aimed at reconstructing Musil’s activity in the WWI journal Tiroler Soldaten-Zeitung. First, the article reviews the ways in which the compu- tational methods of stylometry are applied to attribute the anonymous texts published in the Klagenfurter Ausgabe. Second, it explores how optical character recognition (OCR) soft- ware is employed to expand the corpus. At the core of this methodology two machine learn- ing algorithms are trained and revised using the transcriptions of the Klagenfurter Ausgabe, to reach an accuracy of about 99.9% in the digitization of the Tiroler Soldaten-Zeitung texts. The work of this project offers not only the possibility of expanding stylometric analysis to the whole journal, but also of improving the transcriptions of the Klagenfurter Ausgabe.

show abstract

“…While OCR errors remain part of a wider problem of dealing with "noise" in text mining [23], their impact varies depending on the task performed [24]. NLP tasks such as machine translation, sentence boundary detection, tokenization, and part-of-speech tagging on text among others can all be compromised by OCR errors [25].…”

Section: Ocr Errors and Topic Modelingmentioning

confidence: 99%

Evaluating the Impact of OCR Errors on Topic Modeling

Mutuvi

Doucet

Odeo

et al. 2018

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Historical documents pose a challenge for character recognition due to various reasons such as font disparities across different materials, lack of orthographic standards where same words are spelled differently, material quality and unavailability of lexicons of known historical spelling variants. As a result, optical character recognition (OCR) of those documents often yield unsatisfactory OCR accuracy and render digital material only partially discoverable and the data they hold difficult to process. In this paper, we explore the impact of OCR errors on the identification of topics from a corpus comprising text from historical OCRed documents. Based on experiments performed on OCR text corpora, we observe that OCR noise negatively impacts the stability and coherence of topics generated by topic modeling algorithms and we quantify the strength of this impact.

show abstract

Mind your corpus: systematic errors in authorship attribution

Cited by 34 publications

References 18 publications

Textual Cultures 12.2

Textual Cultures 12.2

A Digital Edition between Stylometry and OCR:

Evaluating the Impact of OCR Errors on Topic Modeling

Contact Info

Product

Resources

About