An automatic spelling correcting algorithm corrects most of the 50,000 misspellings culled from 25,000,000 words of text from seven scientific and scholarly databases. It uses a similarity key to identify words in a large dictionary that are most similar to a particular misspelling, and'then an error-reversal test to select from these the most plausible correction(s).
The SPEEDCOP (SPElling Error Detection Correction Project) project recently completed at Chemical Abstracts Service (CAS) extracted over 50,000 misspellings from approximately 25,000,000 words of text from seven scientific and scholarly databases. The misspellings were automatically classified and the error types analyzed. The results, which were consistent over the different databases, showed that the expected incidence of misspelling is 0.2%, that 90-95% of spelling errors have only a single mistake, that substitution is homogeneous while transposition is heterogeneous, that omission is the commonest type of misspelling, and that inadvertent doubling of a letter is the most important cause of insertion errors. The more frequently a letter occurs in the text, the more likely it is to be involved in a spelling error. Most misspellings collected by SPEEDCOP are of the type colloquially referred to as "typos" and approximately 90% are unlikely to be repeated in normal spans of text.
INTRODUCTIONNOT ONLY DOES the problem of correcting spelling errors by computer have a long history, it is evidently of considerable current interest as papers 17,95 and letters 18,30,57,66,69 on the topic continue to appear rapidly. This is not surprising, since techniques useful in detecting and correcting mis-spellings normally have other important applications. Moreover, both the power of small computers and the routine production of machine-readable text have increased enormously over the last decade to the point where automatic spelling error detection/correction has become not only feasible but highly desirable.Potential applications for spelling error detection/correction techniques arise in numerous applications. Early papers focused on the correction of output from optical character recognition (OCR), voice recognition, or Morse code, or on spelling errors in program code, but the domain of most interest today is probably the correction of machine-readable text made available by word processing. However, methods for assessing the similarity of two strings of symbols, which are widely used to compare mis-spellings with dictionary words, are of very general interest; e.g., for determining the evolutionary distance of proteins. 56,70,72 Similarly, one can imagine spelling correction techniques being extended to almost any kind of error-prone transmission, even to partially decrypted code. Also, spelling error detection involves searching large dictionaries; and this capability is obviously of widespread utility.This note attempts to provide a comprehensive bibliography of papers in English on the major aspects of spelling error detection and correction of English text. The author is solely reponsible for the content of the annotations. SPELLING ERROR DETECTIONThe goal of spelling error detection is basically to decide if a text string is a valid word; this is normally done by determining whether or not the string is in a system dictionary. As both the dictionary and the number of words to be processed are usually large in real-world systems, it is important to make the dictionary search highly efficient. Note that words need not be literally present in the dictionary; they may be stored much more economically as, for example, hash codes, patterns of bits distributed over a long string, or n-grams. However, in compressed representations, one usually has to be content with a very high probability that a given word is present or not rather than with the certainty given by a literal dictionary. Similarly, the dictionary may be searched via tries, trees, hash coding (scatter storage) or a variety of other techniques.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.