2019
DOI: 10.1016/j.specom.2019.01.004
|View full text |Cite
|
Sign up to set email alerts
|

End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition

Abstract: In hidden Markov model (HMM) based automatic speech recognition (ASR) system, modeling the statistical relationship between the acoustic speech signal and the HMM states that represent linguistically motivated subword units such as phonemes is a crucial step. This is typically achieved by first extracting acoustic features from the speech signal based on prior knowledge such as, speech perception or/and speech production knowledge, and, then training a classifier such as artificial neural networks (ANN), Gauss… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
67
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 140 publications
(67 citation statements)
references
References 41 publications
0
67
0
Order By: Relevance
“…To better understand the spectral information being modelled by the CNNs, we analysed the cumulative frequency response of the first convolutional layer filters, as done in [39,21]: where N f is the number of filters and F k is the frequency response of filter f k . Fig.…”
Section: Analysis Of Frequency Response Of the First Layer Filtersmentioning
confidence: 99%
“…To better understand the spectral information being modelled by the CNNs, we analysed the cumulative frequency response of the first convolutional layer filters, as done in [39,21]: where N f is the number of filters and F k is the frequency response of filter f k . Fig.…”
Section: Analysis Of Frequency Response Of the First Layer Filtersmentioning
confidence: 99%
“…In the conventional ASR systems ( Fig.1-conventional method), the task of recognizing speech is divided into several subtasks, each of which are optimized independently. In [12,9], an end-to-end acoustic modeling approach was proposed, where both the features and the classifier are jointly learned. As shown in Fig.1-proposed method, the CNN based end-to-end acoustic modeling approach is composed of a feature learning stage, that consists of several convolution layers, and a classifier stage, that consists of fully connected (FC) layers (also called a multi-layer perceptron (MLP)) and an output layer.…”
Section: Background and Motivationmentioning
confidence: 99%
“…speech signal of about 2ms, which is less than one pitch period. Upon analysis of the filters using two different methods, namely, spectral dictionary based interpretation [12] and guided backpropogation based analysis [13], it was found that the CNN learns to model formant frequency information for phone posterior probability estimation. This is interesting, given the fact that the approach does not assume any specific model for the speech signal.…”
Section: Background and Motivationmentioning
confidence: 99%
See 1 more Smart Citation
“…It was observed that, context dependent untied models overtake others by having lower word error rate and better accuracy. An end-to-end acoustic modeling approach using convolutional neural networks (CNNs) for HMM based ASR was proposed in [29]. In their proposed acoustic modeling approach, the appropriate features and the classifier are mutually learned from the raw speech signals.…”
Section: Related Workmentioning
confidence: 99%