Unlimited vocabulary speech recognition for agglutinative languages

Kurimo, Mikko; Puurula, Antti; Arisoy, Ebru; Siivola, Vesa; Hirsimäki, Teemu; Pylkkönen, Janne; Alumäe, Tanel; Saraçlar, Murat

doi:10.3115/1220835.1220897

Cited by 59 publications

(47 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The baseline algorithm has been found to be very useful in automatic speech recognition of agglutinative languages (Kurimo et al, 2006). However, it often oversegments morphemes that are rare or not seen at all in the training data.…”

Section: Finnish-to-english Translationmentioning

confidence: 99%

Minimum Bayes risk combination of translation hypotheses from alternative morphological decompositions

Gispert

Virpioja²,

Kurimo³

et al. 2009

Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Com

Self Cite

View full text Add to dashboard Cite

We describe a simple strategy to achieve translation performance improvements by combining output from identical statistical machine translation systems trained on alternative morphological decompositions of the source language. Combination is done by means of Minimum Bayes Risk decoding over a shared Nbest list. When translating into English from two highly inflected languages such as Arabic and Finnish we obtain significant improvements over simply selecting the best morphological decomposition.

show abstract

Section: Finnish-to-english Translationmentioning

confidence: 99%

Minimum Bayes risk combination of translation hypotheses from alternative morphological decompositions

Gispert

Virpioja²,

Kurimo³

et al. 2009

Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Com

Self Cite

View full text Add to dashboard Cite

show abstract

“…Segmentation of words, clitics, and affixes is essential for a number of natural language processing (NLP) applications, including machine translation, parsing, and speech recognition (Chang et al, 2008;Tsarfaty, 2006;Kurimo et al, 2006). Segmentation is a common practice in Arabic NLP due to the language's morphological richness.…”

Section: Introductionmentioning

confidence: 99%

Word Segmentation of Informal Arabic with Domain Adaptation

Monroe

Green

Manning

2014

Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

View full text Add to dashboard Cite

Segmentation of clitics has been shown to improve accuracy on a variety of Arabic NLP tasks. However, state-of-the-art Arabic word segmenters are either limited to formal Modern Standard Arabic, performing poorly on Arabic text featuring dialectal vocabulary and grammar, or rely on linguistic knowledge that is hand-tuned for each dialect. We extend an existing MSA segmenter with a simple domain adaptation technique and new features in order to segment informal and dialectal Arabic text. Experiments show that our system outperforms existing systems on broadcast news and Egyptian dialect, improving segmentation F 1 score on a recently released Egyptian Arabic corpus to 92.09%, compared to 91.60% for another segmenter designed specifically for Egyptian Arabic.

show abstract

“…The Morfessor Baseline model has been a popular method for segmenting Finnish, Estonian and other agglutinative languages for speech recognition [11,12]. In this work, we use the Morfessor 2.0 implementation [13].…”

Section: Morfessormentioning

confidence: 99%

Improved Subword Modeling for WFST-Based Speech Recognition

Smit¹,

Virpioja²,

Kurimo³

2017

Interspeech 2017

Self Cite

View full text Add to dashboard Cite

Because in agglutinative languages the number of observed word forms is very high, subword units are often utilized in speech recognition. However, the proper use of subword units requires careful consideration of details such as silence modeling, position-dependent phones, and combination of the units. In this paper, we implement subword modeling in the Kaldi toolkit by creating modified lexicon by finite-state transducers to represent the subword units correctly. We experiment with multiple types of word boundary markers and achieve the best results by adding a marker to the left or right side of a subword unit whenever it is not preceded or followed by a word boundary, respectively. We also compare three different toolkits that provide data-driven subword segmentations. In our experiments on a variety of Finnish and Estonian datasets, the best subword models do outperform word-based models and naive subword implementations. The largest relative reduction in WER is a 23% over word-based models for a Finnish read speech dataset. The results are also better than any previously published ones for the same datasets, and the improvement on all datasets is more than 5%.

show abstract

Unlimited vocabulary speech recognition for agglutinative languages

Cited by 59 publications

References 17 publications

Minimum Bayes risk combination of translation hypotheses from alternative morphological decompositions

Minimum Bayes risk combination of translation hypotheses from alternative morphological decompositions

Word Segmentation of Informal Arabic with Domain Adaptation

Improved Subword Modeling for WFST-Based Speech Recognition

Contact Info

Product

Resources

About