OCR Post Correction for Endangered Language Texts

Rijhwani, Shruti; Anastasopoulos, Antonios; Neubig, Graham

doi:10.18653/v1/2020.emnlp-main.478

Cited by 23 publications

(37 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Two-Pass Decoding: Two-pass decoding involves first predicting with one decoder and then re-evaluating with another decoder (Geng et al, 2018;Sainath et al, 2019;Hu et al, 2020;Rijhwani et al, 2020). The two decoders iterate on the same sequence, so there is no decomposition into sub-tasks in this method.…”

Section: Discussion and Relation To Prior Workmentioning

confidence: 99%

Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks

Dalmia¹,

Yan²,

Raunak³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

End-to-end approaches for sequence tasks are becoming increasingly popular. Yet for complex sequence tasks, like speech translation, systems that cascade several models trained on sub-tasks have shown to be superior, suggesting that the compositionality of cascaded systems simplifies learning and enables sophisticated search capabilities. In this work, we present an end-to-end framework that exploits compositionality to learn searchable hidden representations at intermediate stages of a sequence model using decomposed sub-tasks. These hidden intermediates can be improved using beam search to enhance the overall performance and can also incorporate external models at intermediate stages of the network to re-score or adapt towards out-of-domain data. One instance of the proposed framework is a Multi-Decoder model for speech translation that extracts the searchable hidden intermediates from a speech recognition sub-task. The model demonstrates the aforementioned benefits and outperforms the previous state-of-theart by around +6 and +3 BLEU on the two test sets of Fisher-CallHome and by around +3 and +4 BLEU on the English-German and English-French test sets of

show abstract

Section: Discussion and Relation To Prior Workmentioning

confidence: 99%

Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks

Dalmia¹,

Yan²,

Raunak³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

show abstract

“…Additionally, the improvements we achieve through semi-supervised learning are potentially orthogonal to the improvements Rijhwani et al (2020) achieve by incorporating information from translations of the target text. As future work, we plan to investigate the combination of these two approaches in an attempt to utilize all available sources of information to improve performance.…”

Section: Discussionmentioning

confidence: 99%

“…Even state-of-the-art OCR models are susceptible to making recognition errors (Dong and Smith, 2018). Errors are particularly frequent in the case of endangered languages because most off-the-shelf OCR tools do not directly support these languages and training a high-performance OCR system is challenging given the small amount of data that is typically available (Rijhwani et al, 2020). We use OCR post-correction to correct these errors and improve the quality of the transcription.…”

Section: Ocr Post-correctionmentioning

confidence: 99%

“…As the base post-correction model, we use the model from Rijhwani et al (2020): a sequenceto-sequence model that uses an attention-based LSTM encoder-decoder (Bahdanau et al, 2015), with adaptations for low-resource OCR postcorrection. We briefly describe the method here but refer readers to the original paper for details.…”

Section: Base Modelmentioning

confidence: 99%

“…We use the OCR post-correction dataset from Rijhwani et al (2020), which contains transcribed documents in three endangered languages: Ainu, Griko, and Yakkha. Additionally, in this paper, we create a similar dataset in the endangered language Kwak'wala.…”

Section: Datasetsmentioning

confidence: 99%

See 2 more Smart Citations

Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Rijhwani

Rosenblum

Anastasopoulos

et al. 2021

Transactions of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

Much of the existing linguistic data in many languages of the world is locked away in non- digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general- purpose OCR systems on recognition of less- well-resourced languages. However, these methods rely on manually curated post- correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15%–29%, where we find the combination of self-training and lexically aware decoding essential for achieving consistent improvements.1

show abstract