Poetry covers a vast part of the literature of any language. Similarly, Hindi poetry is also having a massive portion in Hindi literature. In Hindi poetry construction, it is necessary to take care of various verse writing rules. This paper focuses on the automatic metadata generation from such poems by computational linguistics integrated advance and systematic, prosody rule-based modeling and detection procedures specially designed for Hindi poetry. The paper covers various challenges and the best possible solutions for those challenges, describing the methodology to generate automatic metadata for "Chhand" based on the poems' stanzas. It also provides some advanced information and techniques for metadata generation for "Muktak Chhands". Rules of the "Chhands" incorporated in this research were identified, verified, and modeled as per the computational linguistics perspective the very first time, which required a lot of effort and time. In this research work, 111 different "Chhand" rules were found. This paper presents rulebased modeling of all of the "Chhands". Out of the all modeled "Chhands" the research work covers 53 "Chhands" for which at least 20 to 277 examples were found and used for automatic processing of the data for metadata generation. For this research work, the automatic metadata generator processed 3120 UTF-8 based inputs of 53 Hindi "Chhand" types, achieved 95.02% overall accuracy, and the overall failure rate was 4.98%. The minimum time taken for the processing of "Chhand" for metadata generation was 1.12 seconds, and the maximum was 91.79 seconds.
Optical Character Recognition (OCR) is a widely-known technique to recognize the printed text using computer with the help of various peripheral devices. Research works for OCR of many languages scripts is in process and many languages are still far away. Gujarati script is one of the least focused script in research area of OCR as compared to other scripts. A wellknown Open Source OCR Engine called Tesseract which is already used for the recognition of numerous scripts, can be used to recognize printed Gujarati characters from digital images. This paper is trying to enlighten the use of Tesseract to recognize Gujarati characters with the help of already available trained data for Gujarati Script.
The authors of this research paper present a mechanism for dealing with loanwords, missing words, and newly developed terms inclusion issues in WordNets. WordNet has evolved as one of the most prominent Natural Language Processing (NLP) toolkits. This mechanism can be used to improve the WordNet of any language. The authors chose to work with the Hindi and Gujarati languages in this research work to achieve a higher quality research aspect because these are the languages with major dialects. The research work used more than 5000 Hindi verse-based data corpus instead of a prose-based data corpus.As a result, nearly 14000 Hindi words were discovered that were not present in the popular Hindi IndoWordNet, accounting for 13.23 percent of the total existing word count of 105000+. Working with idioms was a distinct method for the Gujarati language. Around 3500 idioms data were used, and nearly 900 Gujarati terms were discovered that did not exist in the IndoWordNet, accounting for nearly 1.4 percent of the total of 64000+ Gujarati words in the IndoWordNet. It will also contribute almost 14000 Hindi words and around 900 Gujarati words to the IndoWordNet project.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.