The capacity to comprehend and communicate with others via language is one of the most valuable human abilities. We are well-trained in our experience reading awareness of different emotions since they play a vital part in communication. Contrary to popular belief, emotion recognition is a challenging task for computers or robots due to the subjective nature of human mood. This research proposes a framework for acknowledging the passionate sections of conversation, independent of the semantic content, via the recognition of discourse feelings. To categorize the emotional content of audio files, this article employs deep learning techniques such as convolutional neural networks (CNNs) and long short-term memories (LSTMs). In order to make sound information as helpful as possible for future use, models using Mel-frequency cepstral coefficients (MFCCs) were created. It was tested using RAVDESS and TESS datasets and found that the CNN had a 97.1% accuracy rate.