Spoken Language Identification Using ConvNets

Sarthak,; Shukla, Shikhar; Mittal, Govind

doi:10.1007/978-3-030-34255-5_17

Cited by 24 publications

(16 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The attributes of the proposed method are represented in Table 6 . The trial and error method is used while running the convolution neural network [ 8 , 14 ], word embedding Keras [ 34 , 35 ], and Naïve Bayes [ 36 – 38 ]. The selection of hyperparameter is also defined as an NP-complete problem [ 39 , 40 ].…”

Section: Resultsmentioning

confidence: 99%

“…Various state-of-the-art results on various audio classification tasks have been obtained by using log-Mel spectrograms of raw audio, like features, which convert the audio utterance into images [ 8 ]. CNN gives an excellent performance gain in classification on these features [ 14 ]. The motivation of work has come from these studies.…”

Section: Proposed Spoken Language Identification Frameworkmentioning

confidence: 99%

“…The process of spoken language identification using the CNN technique uses spectrograms of raw audio signals as input to a convolutional neural network (CNN) [ 8 , 14 ]. A spoken language identification dataset is collected and preprocessed for the training phase.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Spoken Language Identification Using Deep Learning

Singh

Sharma

Kumar

et al. 2021

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

The process of detecting language from an audio clip by an unknown speaker, regardless of gender, manner of speaking, and distinct age speaker, is defined as spoken language identification (SLID). The considerable task is to recognize the features that can distinguish between languages clearly and efficiently. The model uses audio files and converts those files into spectrogram images. It applies the convolutional neural network (CNN) to bring out main attributes or features to detect output easily. The main objective is to detect languages out of English, French, Spanish, and German, Estonian, Tamil, Mandarin, Turkish, Chinese, Arabic, Hindi, Indonesian, Portuguese, Japanese, Latin, Dutch, Portuguese, Pushto, Romanian, Korean, Russian, Swedish, Tamil, Thai, and Urdu. An experiment was conducted on different audio files using the Kaggle dataset named spoken language identification. These audio files are comprised of utterances, each of them spanning over a fixed duration of 10 seconds. The whole dataset is split into training and test sets. Preparatory results give an overall accuracy of 98%. Extensive and accurate testing show an overall accuracy of 88%.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Proposed Spoken Language Identification Frameworkmentioning

confidence: 99%

See 1 more Smart Citation

Spoken Language Identification Using Deep Learning

Singh

Sharma

Kumar

et al. 2021

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

show abstract

“…In a more traditional linguistic setting, Sarthak et al [34] explore 1D-ConvNet that auto-extracts and classifies features from raw audio input and 2D-ConvNet architectures, and enhance the performance of these approaches by utilizing Mixup augmentation of inputs and attention mechanism. They achieve 93.7% and 95.4% overall accuracy on a six language (En, Fr, De, Es, Ru, It) dataset with overlapping phonemes based on the VoxForge [38] dataset.…”

Section: Previous Workmentioning

confidence: 99%

Low-Resource Spoken Language Identification Using Self-Attentive Pooling and Deep 1D Time-Channel Separable Convolutions

Bedyakin¹,

Htsts²,

Mikhaylovskiy³

2021

Computational Linguistics and Intellectual Technologies

View full text Add to dashboard Cite

This memo describes NTR/TSU winning submission for Low Resource ASR challenge at Dialog2021 conference, language identification track.Spoken Language Identification (LID) is an important step in a multilingual Automated Speech Recognition (ASR) system pipeline. Traditionally, the ASR task requires large volumes of labeled data that are unattainable for most of the world's languages, including most of the languages of Russia. In this memo, we show that a convolutional neural network with a Self-Attentive Pooling layer shows promising results in low-resource setting for the language identification task and set up a SOTA for the Low Resource ASR challenge dataset.Additionally, we compare the structure of confusion matrices for this and significantly more diverse VoxForge dataset and state and substantiate the hypothesis that whenever the dataset is diverse enough so that the other classification factors, like gender, age etc. are well-averaged, the confusion matrix for LID system bears the language similarity measure.

show abstract

“…Sarthak, et al [21] used raw audio waveforms as sound input which provide increased performance by avoiding overheads in calculating log-Mel spectrum for each audio file. This study uses the convolutional neural networks classification method because based on the journals studied, convolutional neural networks have very good performance compared to other machine learning techniques, the sound input data used is raw audio waveform because it is quite popular, because raw audio waveforms have advantages.…”

Section: Introductionmentioning

confidence: 99%

Spoken language identification using i-vectors, x-vectors, PLDA and logistic regression

Abdurrahman

Zahra

2021

Bulletin EEI

View full text Add to dashboard Cite

In this paper, i-vector and x-vector is used to extract the features from speech signal from local Indonesia languages, namely Javanese, Sundanese and Minang languages to help classifier identify the language spoken by the speaker. Probabilistic linear discriminant analysis (PLDA) are used as the baseline classifier and logistic regression technique are used because of prior studies showing logistic regression has better performance than PLDA for classifying speech data. Once these features are extracted. The feature is going to be classified using the classifier mentioned before. In the experiment, we tried to segment the test data to three segment such as 3, 10, and 30 seconds. This study is expanded by testing multiple parameters on the i-vector and x-vector method then comparing PLDA and logistic regression performance as its classifier. The x-vector has better score than i-vector for every segmented data while using PLDA as its classifier, except where the i-vector and x-vector is using logistic regression, i-vector still has better accuracy compared to x-vector.

show abstract

Spoken Language Identification Using ConvNets

Cited by 24 publications

References 23 publications

Spoken Language Identification Using Deep Learning

Spoken Language Identification Using Deep Learning

Low-Resource Spoken Language Identification Using Self-Attentive Pooling and Deep 1D Time-Channel Separable Convolutions

Spoken language identification using i-vectors, x-vectors, PLDA and logistic regression

Contact Info

Product

Resources

About