In Nepali language, compound word formation is mostly associated with inflection, derivation, and postposition attachment. Inflection occurs due to suffixation, whereas derivation is driven by both prefixation and suffixation. The compound word generated by the rules may produce lots of out-of-vocabulary words due to limited lexical resources and numerous exceptions. Hence, the machine learning approach can help to generate valid compounds and split them into valid morphemes that can be further used as a resource for spelling suggestions, information retrieval, and machine translation. In this research, a method to generate valid compounds from the corresponding compound splits (head word and prefix/suffix/ postpositions) is suggested. A BiLSTM based deep learning approach was used to generate and split the valid compound words. Publicly available Nepali Brihat Shabdakosh data from Nepal Academy and scraped news data were used for the experimentation. The obtained results were found to be outstanding compared to the rule-based approach applied to a similar job.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.