2006
DOI: 10.1007/11946465_25
|View full text |Cite
|
Sign up to set email alerts
|

Language Modelling for the Needs of OCR of Medical Texts

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
5
0

Year Published

2012
2012
2016
2016

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 5 publications
0
5
0
Order By: Relevance
“…To the best of our knowledge, the only existing OCR error detection and correction systems for medical texts focus on either OCR correction for historical text with adapted language models (Thompson et al, 2015) or OCR recognition of handwritten notes by doctors, which is not surprising given the absence of large OCRed text corpora in this domain. Notable work in this area was carried out by Piasecki et al (2006) who examined the construction of word-level language models to improve OCR correction of Polish handwritten medical notes. They found that the repetitive character sequences and recurrent structure of medical notes greatly aided the construction of language models but that this positive effect is domain-specific and does not carry over the similar corpora in a different medical subdomain.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…To the best of our knowledge, the only existing OCR error detection and correction systems for medical texts focus on either OCR correction for historical text with adapted language models (Thompson et al, 2015) or OCR recognition of handwritten notes by doctors, which is not surprising given the absence of large OCRed text corpora in this domain. Notable work in this area was carried out by Piasecki et al (2006) who examined the construction of word-level language models to improve OCR correction of Polish handwritten medical notes. They found that the repetitive character sequences and recurrent structure of medical notes greatly aided the construction of language models but that this positive effect is domain-specific and does not carry over the similar corpora in a different medical subdomain.…”
Section: Introductionmentioning
confidence: 99%
“…In the first step, a potential OCR error is detected using either a lookup in a domain-specific lexicon (Kissos and Dershowitz, 2016) or unigram language model (Bassil and Alwani, 2012), and/or by consulting information from the OCR process, i.e., the confidence scores of the recognized characters. The second step, candidate generation, also heavily depends on external resources, either by generating potential candidate replacements for the erroneous words from a lexicon (Piasecki and Godlewski, 2006) or by learning and using a mapping of characters that were often interchanged during the OCR process to generate potential candidates with string distance metrics (Kukich, 1992). Such mappings are known as 'character confusions' but need to be learned over a training corpus of a considerable size before they can become effective (Evershed and Fitch, 2014).…”
Section: Introductionmentioning
confidence: 99%
“…Notable work in this area was carried out by Piasecki et al (2006) who examined the construction of word-level language models to improve OCR correction of Polish handwritten medical notes. They 62 found that the repetitive character sequences and recurrent structure of medical notes greatly aided the construction of language models but that this positive effect is domain-specific and does not carry over the similar corpora in a different medical subdomain.…”
Section: Introductionmentioning
confidence: 99%
“…In the first step, a potential OCR error is detected using either a lookup in a domain-specific lexicon (Kissos and Dershowitz, 2016) or unigram language model (Bassil and Alwani, 2012), and/or by consulting information from the OCR process, i.e., the confidence scores of the recognized characters. The second step, candidate generation, also heavily depends on external resources, either by generating potential candidate replacements for the erroneous words from a lexicon (Piasecki and Godlewski, 2006) or by learning and using a mapping of characters that were often interchanged during the OCR process to generate potential candidates with string distance metrics (Kukich, 1992). Such mappings are known as 'character confusions' but need to be learned over a training corpus of a considerable size before they can become effective (Evershed and Fitch, 2014).…”
Section: Introductionmentioning
confidence: 99%
“…Due to Zipf's Law, the sparse data problem is rarely avoided, while this problem brings great difficulties in improving the performance of the disambiguation and OOV identification. Ontology [14], as a concept modeling tool that can present information systems on the semantic and knowledge level [15][16][17], captures the attention of many researchers and play an import role in Knowledge Engineering, Digital Libraries, Software Reuse, Information Retrieval (IR), Semantic Web, the Interoperability of Heterogeneous Information and so on.…”
Section: Introductionmentioning
confidence: 99%