The under-resourced Kikamba language has few language technology tools since the more efficient and popular data driven approaches for developing them suffer from data sparseness due to lack of digitized corpora. To address this challenge, we have developed a computational grammar for the Kikamba language within the multilingual Grammatical Framework (GF) toolkit. GF uses the Interlingua rule-based translation approach. To develop the grammar, we used the morphology driven strategy. Therefore, we first developed regular expressions for morphology inflection and thereafter developed the syntax rules. Evaluation of the grammar was done using one hundred sentences in both English and Kikamba languages. The results were an encouraging four n-gram BLEU score of 83.05% and the Position independent error rate (PER) of 10.96%. Finally, we have made a contribution to the language technology resources for Kikamba including multilingual machine translation, a morphology analyzer, a computational grammar which provides a platform for development of multilingual applications and the ability to generate a variety of bilingual corpora for Kikamba for all languages currently defined in GF, making it easier to experiment with data driven approaches.
The knowledge-driven economy uses technology, thereby increasing the demand for language tools and resources to acquire and distribute the knowledge. Such tools and resources are scarce for the under resourced, spoken Bantu languages. This paper develops a computational grammar for the Ekegusii language in the Grammatical Framework (GF) to bridge the gap. The grammar development uses a bottom-up and modular-driven approach. A machine translation experiment was set up to evaluate the grammar resulting in BLEU and PER of 55.95% and 19.49%, respectively. This work contributes by providing computational grammar for an under-resourced language, thus providing a platform for analysis and synthesis, plus a machine translation within the GF ecosystem.
Part of speech tagging is very important and the initial work towards machine translation and text manipulation. Though much has been done in this regard to the Indo-European and Asiatic languages, development of part of speech tagging tools for African languages is wanting. As a result, these languages are classified as under resourced languages. This paper presents data driven part of speech tagging tools for kikamba which is an under resourced language spoken mostly in Machakos, Makueni and Kitui. The tool is made using the lazy learner called Memory Based Tagger (MBT) with approximately thirty thousand word corpuses. The corpus is collected, cleaned and formatted with regard to MBT and experiment run. Very encouraging performance is reported despite little amount of corpus, which clearly shows that using the state of art technology of data driven methods tools can be developed for under resourced languages. We report a precision of 83%, recall of 72% and F-score of 75% and in terms of accuracy for the known and unknown words, and accuracy of 94.65% and71.93% respectively with overall accuracy of 90.68%..This predicts that with little source of corpus using data driven approach, we can generate tools for the under resourced languages in Kenya.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.