Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis 2016
DOI: 10.18653/v1/w16-6108
|View full text |Cite
|
Sign up to set email alerts
|

Low-resource OCR error detection and correction in French Clinical Texts

Abstract: In this paper we present a simple yet effective approach to automatic OCR error detection and correction on a corpus of French clinical reports of variable OCR quality within the domain of foetopathology. While traditional OCR error detection and correction systems rely heavily on external information such as domain-specific lexicons, OCR process information or manually corrected training material, these are not always available given the constraints placed on using medical corpora. We therefore propose a nove… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 10 publications
(6 citation statements)
references
References 14 publications
0
6
0
Order By: Relevance
“…A character-level bidirectional LSTM language model is developed and applied to post-process digitised French clinical texts by D'hondt et al [39,40]. Given clean texts (i.e., digital collections containing no errors), they automatically create the training material by randomly applying edit operations (i.e., deletion, insertion, substitution).…”
Section: Context-dependent Approachesmentioning
confidence: 99%
See 1 more Smart Citation
“…A character-level bidirectional LSTM language model is developed and applied to post-process digitised French clinical texts by D'hondt et al [39,40]. Given clean texts (i.e., digital collections containing no errors), they automatically create the training material by randomly applying edit operations (i.e., deletion, insertion, substitution).…”
Section: Context-dependent Approachesmentioning
confidence: 99%
“…Some techniques are proposed to generate artificial material such as randomly deleting, inserting, and substituting characters from a given word [39][40][41]; mimicking realistic errors from repetition texts: picking up an alternative from the list of frequent replaceable characters for a given character [57]; or using reversed error model with the input being GT word ngrams [52].…”
Section: Suggested Guidelinesmentioning
confidence: 99%
“…In the general domain, there are several well developed tools that can detect and correct possible OCR errors automatically. But for domain-specific text, popular existing tools based on simple rules can not handle OCR errors well, especially if there is complex and difficult terminology use (D'hondt et al, 2016;Thompson et al, 2015). Facing this problem, it might be necessary to build an OCR correction tool adapted for this domain and genre specifically (Zhang et al, 2019).…”
Section: Discussionmentioning
confidence: 99%
“…Various investigations have been carried out on automatic detection and correction in clinical documentation and, specifically, in clinical reports. Most of these studies have been developed with corpora in English (e.g., Fivez et al, 2017;Workman et al, 2019), but these issues have been also explored for French (D'Hondt et al, 2016), Russian (Balabaeva et al, 2020), Swedish (Dziadek et al, 2017), Dutch (Fivez et al, 2017), Hungarian (Siklo´si et al, 2016), and Persian (Yazdani et al, 2020). These works highlight the substantial number of linguistic errors that the corpora comprising clinical reports usually contain, with studies whose error rate is around 5% (Lai et al, 2015) or even 10% (Ruch et al, 2003).…”
Section: Error Analysis and Automatic Correction In The Medical Domainmentioning
confidence: 99%