The convolutional neural networks (CNNs) lead in the domain of Sound Recognition due to its flexibility and ability with different adjusting parameters. The recognition of spoken English Alphabets by different people with deep learning techniques attracted the research community. In this paper, we are exploring the use of convolutional neural network (CNN), a deep learner that can automatically learn features directly from the dataset while training for the classification of sounds signals of English alphabets. In this proposed work, we consider two CNN architectures. In first architecture, we propose MFCC based features for pretrained two convolutional layer CNN architecture. In the second architecture, we propose a hybrid feature extraction method to train a block-based CNN architecture. The proposed systems consist of two components namely hybrid feature extraction and CNN classifier. The five auditory features log-Mel spectrogram (LM), MFCC, chroma, spectral contrast and Tonnetz features are extracted and then LM & MFCC are combined as one feature set. LM, MFCC, and CST features are aggregated as another for training to the proposed two CNNs, respectively. The different sound samples of English alphabets are collected from different people of different age groups. The feature sets collected from the hybrid feature extraction methods are presented to both the proposed CNNs and the experimental results are collected. The experimental results indicate that the taxonomic accuracy of the proposed architectures can surpass the existing methods of CNNs with single feature extraction methods. The proposed second architecture performs more effectively over the proposed first CNN architecture.