2017
DOI: 10.12685/027.7-5-1-169
|View full text |Cite
|
Sign up to set email alerts
|

Transfer Learning for OCRopus Model Training on Early Printed Books

Abstract: A method is presented that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books when only small amounts of diplomatic transcriptions are available. This is achieved by building from already existing models during training instead of starting from scratch. To overcome the discrepancies between the set of characters of the pretrained model and the additional ground truth the OCRopus code is adapted to allow for alphabet expansion or reduction. T… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
9
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
3
2
2

Relationship

2
5

Authors

Journals

citations
Cited by 7 publications
(9 citation statements)
references
References 4 publications
0
9
0
Order By: Relevance
“…By dividing the GT in N different folds and aligning them in a certain way, it is possible to train N strong but also diverse models which act as voters in a newly created confidence voting scheme. Second, the so-called pretraining functionality allows to build from an already available Calamari model instead of starting training from scratch which not only speeds up the training process considerably but also improves the recognition accuracy [11]. Third, data augmentation using the routines of ocrodeg 10 for generating noisy variations of training material.…”
Section: Calamarimentioning
confidence: 99%
See 1 more Smart Citation
“…By dividing the GT in N different folds and aligning them in a certain way, it is possible to train N strong but also diverse models which act as voters in a newly created confidence voting scheme. Second, the so-called pretraining functionality allows to build from an already available Calamari model instead of starting training from scratch which not only speeds up the training process considerably but also improves the recognition accuracy [11]. Third, data augmentation using the routines of ocrodeg 10 for generating noisy variations of training material.…”
Section: Calamarimentioning
confidence: 99%
“…On the methodical side several improvements have been made by the introduction of voting ensembles, trained with a single OCR engine, whose results are suitably combined [10], and by a pretraining approach which allows to use existing models instead of training from scratch [11].…”
Section: Introductionmentioning
confidence: 99%
“…Obviously, these examples of transfer learning used far deeper networks than OCRopus with only a single hidden layer, resulting in a dramatically increased number of parameters and consequently more opportunities to learn and remember useful low-level features. Nonetheless, since scripts in general should show a high degree of similarity we still expected a noteworthy impact of pretraining and studied the effect of building from an already available mixed model instead of starting training from scratch (see Reul et al, 2017d). As starting points we used the models for modern English, German Fraktur from the 19th century, and the Latin Antiqua model described above.…”
Section: Transfer Learning and Ocr Pretrainingmentioning
confidence: 99%
“…In order to avoid blind spots especially when dealing with small amounts of GT and important but less frequent characters like numbers or capital letters it is possible to define a so-called whitelist. Characters on the whitelist will not get deleted from the matrix no matter if they occur within the GT or not (Reul et al, 2017d).…”
Section: Pretraining Utilizing Transfer Learningmentioning
confidence: 99%
See 1 more Smart Citation