Detection of spelling errors in Swedish not using a word list En Clair*

Domeij, Rickard; Hollman, Joachim; Kann, Viggo

doi:10.1080/09296179408590017

Cited by 15 publications

(9 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If a successful decomposition of an unknown compound word can be made, we have strong evidence of the correct possible tags for that word, since the last word form in a compound determines its part-of-speech. In STAVA [18][19][20] an algorithm for decomposing compounds into their word form parts was implemented.…”

Section: Analyzing Compound Wordsmentioning

confidence: 99%

Implementing an efficient part-of-speech tagger

Carlberger

Kann

1999

Softw: Pract. Exper.

Self Cite

View full text Add to dashboard Cite

An efficient implementation of a part-of-speech tagger for Swedish is described. The stochastic tagger uses a well-established Markov model of the language. The tagger tags 92 per cent of unknown words correctly and up to 97 per cent of all words. Several implementation and optimization considerations are discussed. The main contribution of this paper is the thorough description of the tagging algorithm and the addition of a number of improvements. The paper contains enough detail for the reader to construct a tagger for his own language. 816 J. CARLBERGER AND V. KANN grammar checking. The applications require the tagger to be both efficient (to tag quickly, especially important in information retrieval), and accurate (to tag correctly, especially important in translation). In some applications, it is not even enough to have the text syntactically disambiguated -a word sense disambiguation is needed, and that is an even harder problem [1].Part-of-speech taggers can be constructed in various ways, and different types of taggers have different advantages. Taggers can be based on stochastic models [2-7], on rules [8,9], or on neural networks [10]. In a recent paper, Samuelsson and Voutilainen claim that rule-based taggers can give higher tagging accuracy than plain stochastic taggers on correct texts [11]. However, hybrids between rule-based taggers and stochastic taggers might be even better [12].Some different stochastic models for tagging unknown words exist [2,4]. A good survey of automatic stochastic part-of-speech tagging is Charniak [13].In this paper, we describe an implementation of a part-of-speech tagger for Swedish. We wanted the tagger to be easy to implement, fast, language independent, tag set independent, and that it should give high accuracy of tagging. We also wanted the tagger to be able to cope with unknown words and grammatically erroneous sentences. This ability is needed in various applications, such as grammar and spell checking.Given these requirements, we chose to construct a stochastic tagger based on a Markov model. Our goal was to achieve 95 per cent tagging accuracy for known words and 70 per cent accuracy for unknown words, and we both reached and surpassed the goal.We use the tagger in a grammar checking program for Swedish, named GRANSKA, but we designed it to be as language independent as possible, and we think that it can be used for most inflectional languages, for any tag set, and in any application needing part-of-speech tagging. As it turned out, when incorporated into GRANSKA, our tagger actually became a hybrid between a stochastic tagger and a rule-based tagger. For certain complicated cases where the stochastic tagger could be wrong, we use rules to find the correct tagging. THE TAGGING MODEL Markov modelIn this section, we briefly describe the Markov model that is used as a stochastic model of the language. A complete and excellent description of the equations used in the standard Markov model for part-of-speech tagging can be found in Charniak et al. [2].

show abstract

Section: Analyzing Compound Wordsmentioning

confidence: 99%

Implementing an efficient part-of-speech tagger

Carlberger

Kann

1999

Softw: Pract. Exper.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Accounting for and listing all the possible words is not feasible for purposes of error correction. Domeij proposed a method to build a spell checker that utilizes stem lists and orthographic rules, which govern how a word is written, and morphotactic rules, which govern how morphemes (building blocks of meanings) are allowed to combine, to accept legal combinations of stems (Domeij et al 1994). By breaking up compound words, dictionary lookup can be applied to individual constituent stems.…”

Section: Ocr Error Correctionmentioning

confidence: 99%

Effect of OCR error correction on Arabic retrieval

Magdy¹,

Darwish²

2008

Inf Retrieval

View full text Add to dashboard Cite

Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.

show abstract

“…There are two main approaches to error correction, namely, word level and passage level. Some of the kinds of word-level postprocessing include the use of dictionary lookup [Brill and Moore 2000;Church and Gale 1991;Hong 1995;Jurafsky and Martin 2000a], character [Lu et al 1999;Taghva et al 1994] and word n-gram frequency analysis [Hong 1995;Magdy and Darwish 2006b], and morphological analysis [Domeij et al 1994;Oflazer 1996]. Passagelevel postprocessing techniques include the use of word n-grams [Magdy and Darwish 2006b], word collocations [Hong 1995], grammar [Agirre et al 1998] (which is challenging due to the current poorness of Arabic parsing [Moussa et al 2003]), conceptual closeness [Hong 1995], passage-level word clustering [Taghva et al 1994] (which requires handling of affixes for Arabic [De Roeck and Al-Fares 2000]), and linguistic and visual context [Hong 1995].…”

Section: Ocr Error Correctionmentioning

confidence: 99%

Error correction vs. query garbling for Arabic OCR document retrieval

Darwish

Magdy

2007

ACM Trans. Inf. Syst.

View full text Add to dashboard Cite

Due to the existence of large numbers of legacy documents (such as old books and newspapers), improving retrieval effectiveness for OCR'ed documents continues to be an important problem. This article compares the effect of OCR error correction with and without language modeling and the effect of query garbling with weighted structured queries on the retrieval of OCR degraded Arabic documents. The results suggest that moderate error correction does not yield statistically significant improvement in retrieval effectiveness when indexing and searching using n-grams. Also, reversing error correction models to perform query garbling in conjunction with weighted structured queries yields improved retrieval effectiveness. Lastly, using very good error correction that utilizes language modeling yields the best improvement in retrieval effectiveness.

show abstract

Detection of spelling errors in Swedish not using a word list En Clair*

Cited by 15 publications

References 10 publications

Implementing an efficient part-of-speech tagger

Implementing an efficient part-of-speech tagger

Effect of OCR error correction on Arabic retrieval

Error correction vs. query garbling for Arabic OCR document retrieval

Contact Info

Product

Resources

About