Implementing supervised machine learning on the Hindi corpus for classification and prediction of verses is an untouched and useful area. Classifying and predictions benefits many applications like organizing a large corpus, information retrieval and so on. The metalinguistic facility provided by websites makes Hindi as a major language in the digital domain of information technology today. Text classification algorithms along with Natural Language Processing (NLP) facilitates fast, cost-effective, and scalable solution. Performance evaluation of these predictors is a challenging task. To reduce manual efforts and time spent for reading the document, classification of text data is important. In this paper, 697 Hindi poems are classified based on four topics using four eager machine-learning algorithms. In the absence of any other technique, which achieves prediction on Hindi corpus, misclassification error is used and compared to prove the betterment of the technique. Support vector machine performs best amongst all.
Availability of the text in different languages has become possible, as almost all websites have offered multilingual option. Hindi is considered as official language in one of the states of India. Hindi text analysis is dominated by the corpus of stories and poems. Before performing any text analysis token extraction is an important step and supports many applications like text summarization, categorizing text and so on. Token extraction is a part of Natural language processing (NLP). NLP includes many steps such as preprocessing the corpus, lemmatization and so on. In this paper the tokens are extracted by two methods and on two corpora. BaSa, a context-based term extraction technique having different NLP activities, e.g. Term Frequency Inverse Document Frequency (TF-IDF) and Zipf's law are used to count and compare extracted tokens. Further token comparison between both of the methods is achieved. The corpus contains proses and verses of Hindi as well as the Marathi language. Common tokens from corpora of verses and proses of Marathi as well as Hindi are identified to prove that both of them behave same as per as NLP activities are concerned. The betterment of BaSa over Zipf's law is proved. Hindi Corpus includes 820 stories and 710 poems and Marathi corpus includes 610 stories and 505 poems.
Poetry covers a vast part of the literature of any language. Similarly, Hindi poetry is also having a massive portion in Hindi literature. In Hindi poetry construction, it is necessary to take care of various verse writing rules. This paper focuses on the automatic metadata generation from such poems by computational linguistics integrated advance and systematic, prosody rule-based modeling and detection procedures specially designed for Hindi poetry. The paper covers various challenges and the best possible solutions for those challenges, describing the methodology to generate automatic metadata for "Chhand" based on the poems' stanzas. It also provides some advanced information and techniques for metadata generation for "Muktak Chhands". Rules of the "Chhands" incorporated in this research were identified, verified, and modeled as per the computational linguistics perspective the very first time, which required a lot of effort and time. In this research work, 111 different "Chhand" rules were found. This paper presents rulebased modeling of all of the "Chhands". Out of the all modeled "Chhands" the research work covers 53 "Chhands" for which at least 20 to 277 examples were found and used for automatic processing of the data for metadata generation. For this research work, the automatic metadata generator processed 3120 UTF-8 based inputs of 53 Hindi "Chhand" types, achieved 95.02% overall accuracy, and the overall failure rate was 4.98%. The minimum time taken for the processing of "Chhand" for metadata generation was 1.12 seconds, and the maximum was 91.79 seconds.
Gujarati language is the Indo-Aryan language spoken by the Gujaratis, the people of the state of Gujarat of India. Gujarati is the one of the 22 official languages recognized by the Indian government. Gujarati script was adopted from Devanagari script. Approximately 3000 idioms are available in Gujarati language. Machine translation of any idiom is the challenging task because contextual information is important for the translation of a particular idiom. For the translation of Gujarati idioms into English or any other language, surrounding contextual words are considered for the translation of specific idiom in the case of ambiguity of the meaning of idiom. This paper experiments the IndoWordNet for Gujarati language for getting synonyms of surrounding contextual words. This paper uses n-gram model and experiments various window sizes surrounding the particular idiom as well as role of stop-words for correct context identification. The paper demonstrates the usefulness of context window in case of ambiguity in the meaning identification of idioms with multiple meanings. The results of this research could be consumed by any destination-independent machine translation system for Gujarati language.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.