“…Studies have considered information access and retrieval (Traub et al, 2018), authorship attribution (Franzini et al, 2018), named entity recognition (Hamdi et al, 2019), and topic modelling (Nelson, 2020;Mutuvi et al, 2018). Recently (Hill and Hengchen, 2019) compared different tasks on a corpus in English: topic modelling, collocation analysis, authorship attribution and vector space modelling. From this study, a critical OCR quality threshold between 70 and 80% emerged, where most tasks perform very poorly below this threshold, good results are achieved above it, and varying results are achieved within, according to the task at hand.…”