Development of security systems using DNN and i &amp; x-vector classifiers

Mamyrbayev, Оrken; Kydyrbekova, Aizat; Alimhan, Keylan; Оралбекова, Дина; Zhumazhanov, Bagashar; Nuranbayeva, Bulbul M.

doi:10.15587/1729-4061.2021.239186

Cited by 9 publications

(6 citation statements)

References 31 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To train the Transformer, Transformer + CTC models with LM and without LM, it was decided to divide the corpus of 400 h of speech into two parts: 200 h of "pure" speech and 200 h of spontaneous telephone speech. This corpus was assembled in the laboratory "Computer Engineering of Intelligent Systems" IICT MES RK 13 , 26 . When creating the corpus, various types of speech were taken into account: prepared (reading), spontaneous.…”

Section: Experiments and Resultsmentioning

confidence: 99%

A study of transformer-based end-to-end speech recognition system for Kazakh language

Mamyrbayev

Dina

Alimhan

et al. 2022

Sci Rep

View full text Add to dashboard Cite

Today, the Transformer model, which allows parallelization and also has its own internal attention, has been widely used in the field of speech recognition. The great advantage of this architecture is the fast learning speed, and the lack of sequential operation, as with recurrent neural networks. In this work, Transformer models and an end-to-end model based on connectionist temporal classification were considered to build a system for automatic recognition of Kazakh speech. It is known that Kazakh is part of a number of agglutinative languages and has limited data for implementing speech recognition systems. Some studies have shown that the Transformer model improves system performance for low-resource languages. Based on our experiments, it was revealed that the joint use of Transformer and connectionist temporal classification models contributed to improving the performance of the Kazakh speech recognition system and with an integrated language model it showed the best character error rate 3.7% on a clean dataset.

show abstract

Section: Experiments and Resultsmentioning

confidence: 99%

A study of transformer-based end-to-end speech recognition system for Kazakh language

Mamyrbayev

Dina

Alimhan

et al. 2022

Sci Rep

View full text Add to dashboard Cite

show abstract

“…Besides it has been proposed to replace UBM and i-vector classifier by deep neural network (DNN) taking into account the experience of deep learning for speech recognition [9,10]. The DNNbased d-vector framework assigns the ground-truth speaker identity of a training speech signal as the labels of the training frames of this signal.…”

Section: Analysis Of Recent Research and Publicationsmentioning

confidence: 99%

“…Then the mean and standard deviation of the framelevel features of a signal are concatenated as a segment-level feature by a statistical pooling layer. Finally the segment-level features are classified to its speaker by a feedforward network [9,11].…”

Section: Analysis Of Recent Research and Publicationsmentioning

confidence: 99%

The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space

Matychenko,

Polyakova

2023

HAIT

View full text Add to dashboard Cite

As a result of the literature analysis, the main methods for speaker identification from speech signals were defined. These are statistical methods based on Gaussian mixture model and a universal background model, as well as neural network methods, in particular, using convolutional or Siamese neural networks. The main characteristics of these methods are the recognition performance, a number of parameters, and the training time. High recognition performance is achieved by using convolutional neural networks, but a number of parameters of these networks are much higher than for statistical methods, although lower than for Siamese neural networks. A large number of parameters require a large training set, which is not always available for the researcher. In addition, despite the effectiveness of convolutional neural networks, model size and inference efficiency remain important fordevices with a limited source of computing power, such as peripheral or mobile devices. Therefore, the aspects of tuning of the structure of existing convolutional neural networks are relevant for research. In this work, we have performed a structural tuning of an existing convolutional neural network based on the VGGNet architecture for speaker identification in the space of mel frequency cepstrum coefficients. The aim of the work was to reduce the number of neural network parameters and, as a result, to reduce the network training time, provided that the recognition performance is sufficient (the correct recognition is above 95 %). The neural network proposed as a result of structural tuning has fewer layers than the architecture of the basic neural network. Insteadof the ReLU activation function, the related Leaky ReLU function with a parameter of 0.1 was used. The number of filters and the size of kernels in convolutional layers are changed. The size of kernels for the max pooling layer has been increased. It is proposedto use the averaging of the results of each convolution to input a two-dimensional convolution results to a fully connected layer with the Softmax activation function. The performed experiment showed that the number of parameters of the proposed neural network is 29 % less than the number of parameters of the basic neural network, provided that the speaker recognition performance is almost the same. In addition, the training time of the proposed and basic neural networks was evaluated on five datasets of audio recordings corresponding to different numbers of speakers. The training time of the proposed network was reduced by 10-39 % compared to the basic neural network. The results of the research show the advisability of the structural tuning of the convolutional neural network for devices with a limited source of computing, namely, peripheral or mobile devices.

show abstract

“…Thus, our extracted features were already high-level, and there was no need to map these original data to phonemes. In this work, we implemented our model using shallow bidirectional LSTMs [29].…”

Section: Joint Application Of Connectionist Temporal Classification A...mentioning

confidence: 99%

Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level

Mamyrbayev

Alimhan

Оралбекова

et al. 2022

EEJET

Self Cite

View full text Add to dashboard Cite

Ensuring the best quality and performance of modern speech technologies, today, is possible based on the widespread use of machine learning methods. The idea of this project is to study and implement an end-to-end system of automatic speech recognition using machine learning methods, as well as to develop new mathematical models and algorithms for solving the problem of automatic speech recognition for agglutinative (Turkic) languages. Many research papers have shown that deep learning methods make it easier to train automatic speech recognition systems that use an end-to-end approach. This method can also train an automatic speech recognition system directly, that is, without manual work with raw signals. Despite the good recognition quality, this model has some drawbacks. These disadvantages are based on the need for a large amount of data for training. This is a serious problem for low-data languages, especially Turkic languages such as Kazakh and Azerbaijani. To solve this problem, various methods are needed to apply. Some methods are used for end-to-end speech recognition of languages belonging to the group of languages of the same family (agglutinative languages). Method for low-resource languages is transfer learning, and for large resources – multi-task learning. To increase efficiency and quickly solve the problem associated with a limited resource, transfer learning was used for the end-to-end model. The transfer learning method helped to fit a model trained on the Kazakh dataset to the Azerbaijani dataset. Thereby, two language corpora were trained simultaneously. Conducted experiments with two corpora show that transfer learning can reduce the symbol error rate, phoneme error rate (PER), by 14.23 % compared to baseline models (DNN+HMM, WaveNet, and CNC+LM). Therefore, the realized model with the transfer method can be used to recognize other low-resource languages.

show abstract

Development of security systems using DNN and i & x-vector classifiers

Cited by 9 publications

References 31 publications

A study of transformer-based end-to-end speech recognition system for Kazakh language

A study of transformer-based end-to-end speech recognition system for Kazakh language

The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space

Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level

Contact Info

Product

Resources

About