Lexicon is in important resource in any kind of language processing application. Corpus-based lexica have several advantages over other traditional approaches. The lexicon developed for Sinhala was based on the text obtained from a corpus of 10 million words drawn from diverse genres. The words extracted from the corpus have been labeled with parts of speech categories defined according to a novel classification proposed for Sinhala. The lexicon reports 80% coverage over unrestricted text obtained from online sources. The lexicon has been implemented in Lexical Mark up Framework.
In this work, we describe the steps and strategies we carried out on defining morpheme segmentation boundaries of Sinhala words (which we called Gold Standard Definitions). We measured the coverage of the defined resource against three different Sinhala corpora and obtained over 70% coverage for each corpora. Then we report some interesting facts and findings about the Sinhala language revealed due to this development and finally about some applications of this valuable linguistic resource.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.