Proceedings - Natural Language Processing in a Deep Learning World 2019
DOI: 10.26615/978-954-452-056-4_051
|View full text |Cite
|
Sign up to set email alerts
|

From the Paft to the Fiiture: a Fully Automatic NMT andWord Embeddings Method for OCR Post-Correction

Abstract: A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a characterbased sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
25
0
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
6
2
1

Relationship

2
7

Authors

Journals

citations
Cited by 22 publications
(26 citation statements)
references
References 15 publications
(15 reference statements)
0
25
0
1
Order By: Relevance
“…Their findings suggest that post-processing is the most effective way of improving a character level NMT normalization model. The same method has been successfully applied in OCR post-correction as well (Hämäläinen and Hengchen, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…Their findings suggest that post-processing is the most effective way of improving a character level NMT normalization model. The same method has been successfully applied in OCR post-correction as well (Hämäläinen and Hengchen, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…However, LMs directly trained on large OCR'd corpora may still yield robust word vectors. They may even be able to position a word and its badly OCR'd variants nearby in the vector space (Hämäläinen and Hengchen, 2019). In such cases, LMs can be used to identify OCR errors and possibly provide a way to correct systematic OCR errors in a large corpus.…”
Section: Language Modelsmentioning
confidence: 99%
“…Since the models span several decades, they present an interesting view of words over time, useful for researchers interested in diachronic studies such as culturomics (Michel et al, 2011), semantic change (see Tahmasebi et al (2018); Kutuzov et al (2018), for overviews), historical research (van Eijnatten & Ros, 2019;Hengchen et al, 2021a;Marjanen et al, 2020), etc. They also can be further fed as input to more complex neural networks tackling downstream tasks aimed at historical data such as OCR post-correction (Hämäläinen & Hengchen, 2019;Duong et al, 2020) or more linguistics-oriented problems (Budts, 2020). Since we release the whole models and not solely the learned vectors, these models can be further trained and specialised, or used by NLP researchers to compare different space alignment procedures.…”
Section: Reuse Potentialmentioning
confidence: 99%