An efficient implementation of a part-of-speech tagger for Swedish is described. The stochastic tagger uses a well-established Markov model of the language. The tagger tags 92 per cent of unknown words correctly and up to 97 per cent of all words. Several implementation and optimization considerations are discussed. The main contribution of this paper is the thorough description of the tagging algorithm and the addition of a number of improvements. The paper contains enough detail for the reader to construct a tagger for his own language. 816 J. CARLBERGER AND V. KANN grammar checking. The applications require the tagger to be both efficient (to tag quickly, especially important in information retrieval), and accurate (to tag correctly, especially important in translation). In some applications, it is not even enough to have the text syntactically disambiguated -a word sense disambiguation is needed, and that is an even harder problem [1].Part-of-speech taggers can be constructed in various ways, and different types of taggers have different advantages. Taggers can be based on stochastic models [2-7], on rules [8,9], or on neural networks [10]. In a recent paper, Samuelsson and Voutilainen claim that rule-based taggers can give higher tagging accuracy than plain stochastic taggers on correct texts [11]. However, hybrids between rule-based taggers and stochastic taggers might be even better [12].Some different stochastic models for tagging unknown words exist [2,4]. A good survey of automatic stochastic part-of-speech tagging is Charniak [13].In this paper, we describe an implementation of a part-of-speech tagger for Swedish. We wanted the tagger to be easy to implement, fast, language independent, tag set independent, and that it should give high accuracy of tagging. We also wanted the tagger to be able to cope with unknown words and grammatically erroneous sentences. This ability is needed in various applications, such as grammar and spell checking.Given these requirements, we chose to construct a stochastic tagger based on a Markov model. Our goal was to achieve 95 per cent tagging accuracy for known words and 70 per cent accuracy for unknown words, and we both reached and surpassed the goal.We use the tagger in a grammar checking program for Swedish, named GRANSKA, but we designed it to be as language independent as possible, and we think that it can be used for most inflectional languages, for any tag set, and in any application needing part-of-speech tagging. As it turned out, when incorporated into GRANSKA, our tagger actually became a hybrid between a stochastic tagger and a rule-based tagger. For certain complicated cases where the stochastic tagger could be wrong, we use rules to find the correct tagging.
THE TAGGING MODEL
Markov modelIn this section, we briefly describe the Markov model that is used as a stochastic model of the language. A complete and excellent description of the equations used in the standard Markov model for part-of-speech tagging can be found in Charniak et al. [2].