Sindhi is highly homographic language, the text is written without diacritics in real life applications, that creates lexical and morphological ambiguity. It is a most critical problem facing Sindhi computational processing and difficult to assign correct syntactic category in the text. Lot of work has been done for diacritic restorations by using statistical and linguistics approaches, still results are not on acceptable level. Tagging the non-diacritic words can be solved using semantic knowledge. This paper describes a rule-based semantic Part of Speech (POS) tagging system that relies on a WordNet to identify the analogical relations between words in the text. The proposed approach is focused on the use of WordNet structures for the task of tagging. POS tagging is a process of assigning correct syntactic categories to each word. Tag set and word disambiguation rules are fundamental parts of any POS tagger. In this research, the tagset for Sindhi POS, word disambiguation rules, tagging and tokenization algorithms are designed and developed. Two types of lexicons are used, one for simple words and other one for disambiguated words. The corpus is collected from a comprehensive Sindhi Dictionary; the corpus is based on the most recent available vocabulary used by local people. The experiments using combination of two lexicons that show promising results and the accuracy of our proposed approach is acceptable.
The text-to-speech (TTS) synthesis technology enables machine to convert text into audible speech and used throughout the world to enhance the accessibility of the information. The important component of any TTS synthesis system is the database of sounds. In this study, three types of sound units i.e., phonemes, diphones and syllables are concatenated to produce natural sound for good quality Sindhi text to speech (STTS) system. The object of this paper consists in treating the phonemes, diphones and syllables under the aspect of the lexicon. The methodology used in STTS is to exploit acoustic representations of speech for synthesis, together with linguistic analyses of text. Sindhi is highly homographic language, the text is written without diacritics in real life applications, that creates lexical and morphological ambiguity. The problem of understating nondiacritic words can be solved using semantic knowledge. This paper describes a Sindhi TTS synthesis system that relies on a WordNet to identify the analogical relations between words in the text. The proposed approach is focused on the use of WordNet structures for the task of synthesis. The architecture and novel algorithm for STTS is proposed. The experiments using WordNet that show promising results and the accuracy of our proposed approach is acceptable.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.