Polish is a synthetic language with a high morpheme-perword ratio. It makes use of a high degree of inflection leading to high out-of-vocabulary (OOV) rates, and high Language Model (LM) perplexities. This poses a challenge for Large Vocabulary and Continuous Speech Recognition (LVCSR) systems. Here, the use of morpheme and syllable based units is investigated for building sub-lexical LMs. A different type of sub-lexical units is proposed based on combining morphemic or syllabic units with corresponding pronunciations. Thereby, a set of grapheme-phoneme pairs called graphones are used for building LMs. A relative reduction of 3.5% in Word Error Rate (WER) is obtained with respect to a traditional system based on full-words.Index Terms-language model, morpheme, syllable, graphone, Polish
INTRODUCTIONPolish is considered as one of the morphologically rich languages. It belongs to the family of Slavic languages like Russian, Czech, and Bulgarian. Polish is characterized by a high degree of inflection, having seven cases and three genders. Declensional endings depend on case, number, gender and animacy. In addition, declension changes if the word is noun or adjective. Moreover, word stems are frequently modified by the addition or absence of endings. This provides huge lexical variety that causes data sparsity and leads to high OOV rates and high LM perplexities. Normally, traditional Polish LVCSR systems use a large recognition lexicon having several hundred thousands of full-words [1]. However, still relatively high OOV rates are obtained. On the other side, the ASR system suffers from high resource requirements. Therefore, sub-words are used instead of full-words in order to reduce the lexical variety. Normally, the number of possible sub-words in a corpus is smaller than that of full-words, giving higher average frequency. This helps to reduce OOV rates and limit the recognition search space. A possible type of sub-word is the morpheme which is the smallest linguistic component of the word that has a semantic meaning. For Slavic languages, morpheme based LMs are proposed [2,3]. They are based on decomposing words into stems and endings. Moreover, morpheme based LMs are used for other languages as German [4] and Arabic [5].Another type of sub-word is the syllable which is considered as a phonological building block of words. A syllable is usually made up of a nuclear vowel with optional initial and final consonants [6]. Syllable based LMs are successfully used for languages like Chinese [7]. In [8] a syllable based LM is proposed for Polish, where both OOV rate and LM perplexity are reduced but no WERs are provided.A different approach is to combine the graphemic subwords with their corresponding pronunciations. This allows different context dependent pronunciations of sub-words to be captured on the level of the LM rather than the lexicon level. In [9], a set of automatically derived morphemes joint with pronunciations augments a normal word model and used for an English LVCSR task. In [10, 4] a set of c...