Abstract:The SPEEDCOP (SPElling Error Detection Correction Project) project recently completed at Chemical Abstracts Service (CAS) extracted over 50,000 misspellings from approximately 25,000,000 words of text from seven scientific and scholarly databases. The misspellings were automatically classified and the error types analyzed. The results, which were consistent over the different databases, showed that the expected incidence of misspelling is 0.2%, that 90-95% of spelling errors have only a single mistake, that su… Show more
“…The bulk of the errors there were shown to be single-character insertions, deletions and substitutions, in line with the findings of previous studies, the largest of which was [5]. In Table 3 and Table 4 we list comparable statistics obtained from the OCRed corpora we here work with: statistics on 5,047 mainly OCR-errors from the sgd and 3,799 from the ddd.…”
Section: Ocr-errors and Other Lexical Variation In Corporasupporting
Abstract. This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce 'tickle') focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Levenshtein distance (henceforth: ld). Simple text-induced filtering techniques help to retain as many as possible of the true positives and to discard as many as possible of the false positives. ticcl has been evaluated on a contemporary OCR-ed Dutch text corpus and on a corpus of historical newspaper articles, whose OCR-quality is far lower and which is in an older Dutch spelling. Representative samples of typographical variants from both corpora have allowed us not only to properly evaluate our system, but also to draw effective conclusions towards the adaptation of the adopted correction mechanism to OCR-error resolution. The performance scores obtained up to ld 2 mean that the bulk of undesirable OCR-induced typographical variation present can fully automatically be removed.
“…The bulk of the errors there were shown to be single-character insertions, deletions and substitutions, in line with the findings of previous studies, the largest of which was [5]. In Table 3 and Table 4 we list comparable statistics obtained from the OCRed corpora we here work with: statistics on 5,047 mainly OCR-errors from the sgd and 3,799 from the ddd.…”
Section: Ocr-errors and Other Lexical Variation In Corporasupporting
Abstract. This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce 'tickle') focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Levenshtein distance (henceforth: ld). Simple text-induced filtering techniques help to retain as many as possible of the true positives and to discard as many as possible of the false positives. ticcl has been evaluated on a contemporary OCR-ed Dutch text corpus and on a corpus of historical newspaper articles, whose OCR-quality is far lower and which is in an older Dutch spelling. Representative samples of typographical variants from both corpora have allowed us not only to properly evaluate our system, but also to draw effective conclusions towards the adaptation of the adopted correction mechanism to OCR-error resolution. The performance scores obtained up to ld 2 mean that the bulk of undesirable OCR-induced typographical variation present can fully automatically be removed.
“…Many approaches have been applied since people started to deal with this problem. Different techniques like edit distance [4], rule-based techniques [10], n-grams [20], probabilistic techniques [14], neural nets [15], similarity key techniques [16,17] and noisy channel model [18,19] have been proposed. All of these are based on the idea of calculating the similarity between the misspelled word and the words contained in a dictionary.…”
Section: Approaches Of Some Spell Checkersmentioning
Abstract-We present a language-independent spell-checker that is based on an enhancement of the n-gram model. The spell checker is proposing correction suggestions by selecting the most promising candidates from a ranked list of correction candidates that is derived based on n-gram statistics and lexical resources. Besides motivating and describing the developed techniques, we briefly discuss the use of the proposed approach in an application for keyword-and semantic-based search support. In addition, the proposed tool was compared with state-of-the-art spelling correction approaches. The evaluation showed that it outperforms the other methods.
“…Pollock and Zamora report on a spelling error detection project at Chemical Abstracts Service (CAS) and charac terize the types of errors they found. 8 Chemical Abstracts databases are among the most searched databases in the world. CAS is usually characterized as a set of sources with considerable depth and breadth.…”
This article discusses structural, systems, and other types of bias that arise in matching new records to large databases. The focus is databases for bibliographic utilities, but other related database concerns will be discussed. Problems of satisfying a “match” with sufficient flexibility and rigor in an environment of imperfect data are presented, and sources of unintentional variance are discussed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.