The Recurrent Neural Network (RNN) is well suited for emotional speech recognition because its uses constantly time shifting property. Even though RNN gives better results GRU, LSTM and BILSTM solves the gradient problem and overfitting problem joins the path to reduces the efficiency. Hence in this paper five deep learning architecture is designed in order to overcome the major issues using data augmentation and spatial feature. Five different architectures like: Enhanced Deep Hierarchal LSTM & GRU (EDHLG), EDHBG, EDHGL, EDHGB & EDHGG are developed with dropout layers. The raw data learned from LSTM will be given as the input to GRU layer for deepest learning. Thus, the gradient problem is reduced, and accuracy of each emotion was increased. Also, to enhance the accuracy level spatial features were concatenated with MFCC. Thus, in all models, the experimental evaluation with the Tamil emotional dataset yielded the best results. EDHLG has a 93.12% accuracy, EDHGL has a 92.56 percent accuracy, EDHBG has a 95.42 percent accuracy, EDHGB has a 96 percent accuracy, and EDHGG has a 94 percent accuracy. Furthermore, the average accuracy rate of a single individual LSTM layer is 74%, while BILSTM is 77%. EDHGB outperforms almost all other systems, by an optimal system of 94.27 percent and then a maximum overall accuracy of 95.99 percent. For the Tamil emotion data, emotional states such as happy, fearful, angry, sad, and neutral have a 100% prediction accuracy, while disgust has a 94 percent efficiency rate and boredom has an 82 percent accuracy rate. Also, the training time and evaluation time utilized by EDHGB is 4.43 mins and 0.42 mins which is less when compared with other models. Hence by changing the LSTM, BILSTM and GRU layers large analysis of experiment on Tamil dataset is done and EDHGB is superior to other models, and when compared with basic models LSTM and BILSTM around 26% more efficiency is gained.