2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2012
DOI: 10.1109/icassp.2012.6288864
|View full text |Cite
|
Sign up to set email alerts
|

Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition

Abstract: Convolutional Neural Networks (CNN) have showed success in achieving translation invariance for many image processing tasks. The success is largely attributed to the use of local filtering and maxpooling in the CNN architecture. In this paper, we propose to apply CNN to speech recognition within the framework of hybrid NN-HMM model. We propose to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance. In our method, a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
208
0
3

Year Published

2014
2014
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 681 publications
(211 citation statements)
references
References 7 publications
0
208
0
3
Order By: Relevance
“…Subsampling can achieve invariance of features with regard to geometric distortion. Due to these advantages, CNN finds applications in computer vision [32][33][34], natural language processing [35,36], and speech recognition [37,38].…”
Section: Convolutional Neural Networkmentioning
confidence: 99%
“…Subsampling can achieve invariance of features with regard to geometric distortion. Due to these advantages, CNN finds applications in computer vision [32][33][34], natural language processing [35,36], and speech recognition [37,38].…”
Section: Convolutional Neural Networkmentioning
confidence: 99%
“…DCNN has been widely used in the field of image recognition, speech recognition and computer vision fields [46][47][48], where it can achieve better performance than the traditional NN model. Recently, it has also been introduced to remote sensing data analysis, such as remote sensing scene classification [43][44][45] and object detection [49][50][51].…”
Section: Discussionmentioning
confidence: 99%
“…Unfortunately, incorporating the delta features into the joint model presented here would be technically challenging, because we would have to propagate the error through the derivatives. However, training the network on several neighbouring spectro-temporal patches instead of just one is possible by modifying the proposed structure so that it has convolutional units (Abdel-Hamid et al, 2012;Vesely et al, 2011). This modification is highlighted in Fig.…”
Section: The Joint Optimization Of Spectro-temporal Features and Neurmentioning
confidence: 99%