2020
DOI: 10.1007/s10032-020-00359-9
|View full text |Cite
|
Sign up to set email alerts
|

Optical character recognition with neural networks and post-correction with finite state methods

Abstract: The optical character recognition (OCR) quality of the historical part of the Finnish newspaper and journal corpus is rather low for reliable search and scientific research on the OCRed data. The estimated character error rate (CER) of the corpus, achieved with commercial software, is between 8 and 13%. There have been earlier attempts to train high-quality OCR models with open-source software, like Ocropy (https://github.com/tmbdev/ocropy) and Tesseract (https://github.com/tesseract-ocr/tesseract), but so far… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
29
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 46 publications
(30 citation statements)
references
References 23 publications
0
29
0
1
Order By: Relevance
“…Työn tavoitteisiin liittyvät läheisesti myös jo tunnistetun tekstin korjauksen menetelmät ja niiden tutkimus [13,12]. Tähän Helsingin yliopistossa tehdyn työn jatkumoon sisältyy runsaasti erityisesti historiallisiin teksteihin keskittyvää tutkimusta [10,11]. Aiempi tutkimus ei kuitenkaan ole keskittynyt nimenomaisesti sanakirjojen tekstintunnistamiseen, johon liittyy hyvin erityisiä haasteita.…”
Section: Aiempi Tutkimusunclassified
“…Työn tavoitteisiin liittyvät läheisesti myös jo tunnistetun tekstin korjauksen menetelmät ja niiden tutkimus [13,12]. Tähän Helsingin yliopistossa tehdyn työn jatkumoon sisältyy runsaasti erityisesti historiallisiin teksteihin keskittyvää tutkimusta [10,11]. Aiempi tutkimus ei kuitenkaan ole keskittynyt nimenomaisesti sanakirjojen tekstintunnistamiseen, johon liittyy hyvin erityisiä haasteita.…”
Section: Aiempi Tutkimusunclassified
“…Character error rate (CER) was calculated for all the experimental results. CER is the percentage of erroneous characters in the system output and is a common metric in OCR-related tasks [78]. It is the number of erroneous characters divided by the sum of correct characters and errors in the output of the system.…”
Section: ) Statistical Testsmentioning
confidence: 99%
“…In this paper we examine to what extent deep CNN-LSTM hybrid neural networks can improve the character accuracy rate (CAR) on 19th century Swedish newspaper text during recognition. Following Drobac and Lindén (2020) approach we trained a character model for Swedish in Calamari and achieved an average CAR of 97.43% which is a new state-of-the-art result for historical Swedish newspaper text. 2…”
Section: Introductionmentioning
confidence: 99%
“…To overcome this limitation researchers have been training language specific character recognition models for different time periods (Furrer and Volk, 2011;Breuel et al, 2013;Krishna et al, 2018;Drobac et al, 2019). There have also been attempts to improve the accuracy of the models after recognition by applying post-correction methods (Drobac and Lindén, 2020;Dannélls and Persson, 2020). In this paper we examine to what extent deep CNN-LSTM hybrid neural networks can improve the character accuracy rate (CAR) on 19th century Swedish newspaper text during recognition.…”
Section: Introductionmentioning
confidence: 99%