Speech Corpora of Under Resourced Languages of North-East India

Deka, Barsha; Chakraborty, Joyshree; Dey, Abhishek; Nath, Sanghamitra; Sarmah, Priyankoo; Nirmala, S. R.; Vijaya, Samudra

doi:10.1109/icsda.2018.8693038

Cited by 11 publications

(7 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…So, we also recorded spoken English from native speakers of Assamese and Bengali. A detailed description of the database of these 3 languages is given in [6]. Salient features of the text and speech corpora are presented in the next subsections.…”

Section: Spoken Language Resourcesmentioning

confidence: 99%

Language Identification of Assamese, Bengali and English Speech

Chakraborty

Nath

Nirmala

et al. 2018

6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018)

Self Cite

View full text Add to dashboard Cite

Machine identification of the language of input speech is of practical interest in regions where people are either bilingual or multi-lingual. Here, we present the development of automatic language identification system that identifies the language of input speech as one of Assamese or Bengali or English spoken by them. The speech databases comprise of sentences read by multiple speakers using their mobile phones. Kaldi toolkit was used to train acoustic models based on hidden Markov model in conjunction with Gaussian mixture models and deep neural networks. The accuracy of the implemented language identification system for test data is 99.3%.

show abstract

Section: Spoken Language Resourcesmentioning

confidence: 99%

Language Identification of Assamese, Bengali and English Speech

Chakraborty

Nath

Nirmala

et al. 2018

6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018)

Self Cite

View full text Add to dashboard Cite

show abstract

“…• Speech corpora for low-resourced languages of North-East India is designed for Assamese, Bengali and Nepali. 1,000 sentences of Assamese language from novels, story books and proverbs were read and recorded by 27 native speakers on telephone channel with the help of interactive voice response system [11].…”

Section: Introductionmentioning

confidence: 99%

Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

Kumar¹,

Singh²,

Ratan³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper we discuss an in-progress work on the development of a speech corpus for four low-resource Indo-Aryan languages -Awadhi, Bhojpuri, Braj and Magahi using the field methods of linguistic data collection. The total size of the corpus currently stands at approximately 18 hours (approx. 4-5 hours each language) and it is transcribed and annotated with grammatical information such as part-of-speech tags, morphological features and Universal dependency relationships. We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID -19 pandemic, with one of the aims being to generate some additional income for low-income groups speaking these languages. In the paper, we also discuss the results of the baseline experiments for automatic speech recognition system in these languages.

show abstract

“…With under-resourced languages (such as [5]) and/or tasks (pathological detection with speech signals), we lack large datasets. By under-resourced, we mean limited digital resources (limited acoustic and text corpora) and/or a lack of linguistic expertise.…”

Section: Introductionmentioning

confidence: 99%

Deep Neural Networks for Automatic Speech Processing: A Survey from Large Corpora to Limited Data

Roger¹,

Farinas²,

Pinquier³

2020

Preprint

View full text Add to dashboard Cite

Most state-of-the-art speech systems are using Deep Neural Networks (DNNs). Those systems require a large amount of data to be learned. Hence, learning state-of-the-art frameworks on under-resourced speech languages/problems is a difficult task. Problems could be the limited amount of data for impaired speech. Furthermore, acquiring more data and/or expertise is time-consuming and expensive. In this paper we position ourselves for the following speech processing tasks: Automatic Speech Recognition, speaker identification and emotion recognition. To assess the problem of limited data, we firstly investigate state-of-the-art Automatic Speech Recognition systems as it represents the hardest tasks (due to the large variability in each language). Next, we provide an overview of techniques and tasks requiring fewer data. In the last section we investigate few-shot techniques as we interpret under-resourced speech as a few-shot problem. In that sense we propose an overview of few-shot techniques and perspectives of using such techniques for the focused speech problems in this survey. It occurs that the reviewed techniques are not well adapted for large datasets. Nevertheless, some promising results from the literature encourage the usage of such techniques for speech processing.

show abstract

Speech Corpora of Under Resourced Languages of North-East India

Cited by 11 publications

References 1 publication

Language Identification of Assamese, Bengali and English Speech

Language Identification of Assamese, Bengali and English Speech

Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

Deep Neural Networks for Automatic Speech Processing: A Survey from Large Corpora to Limited Data

Contact Info

Product

Resources

About