Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.478
|View full text |Cite
|
Sign up to set email alerts
|

OCR Post Correction for Endangered Language Texts

Abstract: There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
36
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 23 publications
(37 citation statements)
references
References 21 publications
1
36
0
Order By: Relevance
“…Two-Pass Decoding: Two-pass decoding involves first predicting with one decoder and then re-evaluating with another decoder (Geng et al, 2018;Sainath et al, 2019;Hu et al, 2020;Rijhwani et al, 2020). The two decoders iterate on the same sequence, so there is no decomposition into sub-tasks in this method.…”
Section: Discussion and Relation To Prior Workmentioning
confidence: 99%
“…Two-Pass Decoding: Two-pass decoding involves first predicting with one decoder and then re-evaluating with another decoder (Geng et al, 2018;Sainath et al, 2019;Hu et al, 2020;Rijhwani et al, 2020). The two decoders iterate on the same sequence, so there is no decomposition into sub-tasks in this method.…”
Section: Discussion and Relation To Prior Workmentioning
confidence: 99%
“…Additionally, the improvements we achieve through semi-supervised learning are potentially orthogonal to the improvements Rijhwani et al (2020) achieve by incorporating information from translations of the target text. As future work, we plan to investigate the combination of these two approaches in an attempt to utilize all available sources of information to improve performance.…”
Section: Discussionmentioning
confidence: 99%
“…Even state-of-the-art OCR models are susceptible to making recognition errors (Dong and Smith, 2018). Errors are particularly frequent in the case of endangered languages because most off-the-shelf OCR tools do not directly support these languages and training a high-performance OCR system is challenging given the small amount of data that is typically available (Rijhwani et al, 2020). We use OCR post-correction to correct these errors and improve the quality of the transcription.…”
Section: Ocr Post-correctionmentioning
confidence: 99%
See 2 more Smart Citations