Historical text presents numerous challenges for contemporary different techniques, e.g. information retrieval, OCR and POS tagging. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system which requires reference to a fixed lexicon accessed by orthographic form. For example, language modeling or retrieval engine for historical text which is produced by OCR systems, where the spelling of words often differ in various way, e.g. one word might have different spellings evolved over time. It is very important to aid those techniques with the rules for automatic mapping of historical wordforms. In this paper, we propose a new technique to model the target modern language by means of a recurrent neural network with long-short term memory architecture. Because the network is recurrent, the considered context is not limited to a fixed size especially due to memory cells which are designed to deal with long-term dependencies. In the set of experiments conducted on the Luther bible database and transform wordforms from Early New High German (ENHG) 14 th -16 th centuries to the corresponding modern wordforms in New High German (NHG). We compare our proposed supervised model LSTM to various methods for computing word alignments using statistical, heuristic models. Our new proposed LSTM outperforms the other three state-of-the-art methods. The evaluation shows the accuracy of our model on the known wordforms is 93.90% and on the unknown wordforms is 87.95%, while the accuracy of the existing state-of-the-art combined approach of the wordlist-based and rule-based normalization models is 92.93% for known and 76.88% for unknown tokens. Our proposed LSTM model outperforms on normalizing the modern wordform to historical wordform. The performance on seen tokens is 93.4%, while for unknown tokens is 89.17%.
The contribution of this paper is a new strategy of integrating multiple recognition outputs of diverse recognizers. Such an integration can give higher performance and more accurate outputs than a single recognition system. The problem of aligning various Optical Character Recognition (OCR) results lies in the difficulties to find the correspondence on character, word, line, and page level. These difficulties arise from segmentation and recognition errors which are produced by the OCRs. Therefore, alignment techniques are required for synchronizing the outputs in order to compare them. Most existing approaches fail when the same error occurs in the multiple OCRs. If the corrections do not appear in one of the OCR approaches are unable to improve the results. We design a Line-to-Page alignment with edit rules using Weighted Finite-State Transducers (WFST). These edit rules are based on edit operations: insertion, deletion, and substitution. Therefore, an approach is designed using Recurrent Neural Networks with Long Short-Term Memory (LSTM) to predict these types of errors. A Character-Epsilon alignment is designed to normalize the size of the strings for the LSTM alignment. The LSTM returns best voting, especially when the heuristic approaches are unable to vote among various OCR engines. LSTM predicts the correct characters, even if the OCR could not produce the characters in the outputs. The approaches are evaluated on OCR's output from the UWIII and historical German Fraktur dataset which are obtained from state-of-the-art OCR systems. The experiments shows that the error rate of the LSTM approach has the best performance with around 0.40%, while other approaches are between 1.26% and 2.31%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.