The paper investigates the architecture of deep neural networks for recognizing human emotions from speech. Convolutional neural networks and recurrent neural networks with an LSTM memory cell were used as models of deep neural networks. An ensemble of neural networks was also built on their basis. Computer experiments on the use of the proposed deep learning models and basic machine learning algorithms for recognizing emotions in human speech contained in the RAVDESS audio database were conducted. The results obtained showed high efficiency of neural network models, and accuracy estimates for individual emotions were 92%.