Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1165
|View full text |Cite
|
Sign up to set email alerts
|

Using Deep Neural Networks for Identification of Slavic Languages from Acoustic Signal

Abstract: This paper investigates the use of deep neural networks (DNNs) for the task of spoken language identification. Various feed-forward fully connected, convolutional and recurrent DNN architectures are adopted and compared against a baseline i-vector based system. Moreover, DNNs are also utilized for extraction of bottleneck features from the input signal. The dataset used for experimental evaluation contains utterances belonging to languages that are all related to each other and sometimes hard to distinguish ev… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
19
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 12 publications
(19 citation statements)
references
References 23 publications
0
19
0
Order By: Relevance
“…Applying ASR methods to SLI, e.g. by training language classifiers on phoneme embeddings extracted from a phoneme recognizer, has shown to work very well [19,20,21,22]. While end-to-end SLI performed directly on labeled speech features is usually outperformed by models that utilize phoneme level information, it is sometimes possible to reach good performance also with end-to-end models [6,23].…”
Section: End-to-end Deep Learning Sli Toolkitmentioning
confidence: 99%
See 1 more Smart Citation
“…Applying ASR methods to SLI, e.g. by training language classifiers on phoneme embeddings extracted from a phoneme recognizer, has shown to work very well [19,20,21,22]. While end-to-end SLI performed directly on labeled speech features is usually outperformed by models that utilize phoneme level information, it is sometimes possible to reach good performance also with end-to-end models [6,23].…”
Section: End-to-end Deep Learning Sli Toolkitmentioning
confidence: 99%
“…The test set contains test utterances of varying length, with the median duration at 15 seconds. DoSL Dataset of Slavic Languages (DoSL), contains speech in 11 Slavic languages [19]. The dataset includes 220 hours of training data and 8 hours of test data, where test utterances are almost uniformly distributed between 5 and 6 seconds.…”
Section: Datasetsmentioning
confidence: 99%
“…In addition to the 39-dimensional MFCCs, we have also utilized 13-dimensional MFCCs with ∆ and ∆∆ coefficients (i.e., a 39-dimensional feature vector as well), and 39-dimensional bottleneck features (BTNs) extracted from the DNN trained for speech recognition (as suggested for the speaker and language identification, e.g., in [39]). Detailed information about our BTN feature extractor can be found in [40].…”
Section: Acoustic Featuresmentioning
confidence: 99%
“…Although DNF-based systems provide the best LID performance, the ASR must be trained first, which requires a large volume of phoneme labeling data. Recently, end-to-end approaches with recurrent neural networks (RNN), convolutional neural networks, and attention-based neural networks have been investigated on LID tasks [8,9].…”
Section: Introductionmentioning
confidence: 99%
“…In this paper, we focus on the identification of variable length utterance spoken language from webcast using LSTM-based and self-attention [11] CNN end-to-end approaches [9,10]. We explore the LID models based on global average pooling.…”
Section: Introductionmentioning
confidence: 99%