INTRODUCTION: Language serves as the primary conduit for human expression, extending its reach into various communication mediums like email and text messaging, where emoticons are frequently employed to convey nuanced emotions. In the digital landscape of long-distance communication, the detection and analysis of emotions assume paramount importance. However, this task is inherently challenging due to the subjectivity inherent in emotions, lacking a universal consensus for quantification or categorization.OBJECTIVES: This research proposes a novel speech recognition model for emotion analysis, leveraging diverse machine learning techniques along with a three-layer feature extraction approach. This research will also through light on the robustness of models on balanced and imbalanced datasets. METHODS: The proposed three-layered feature extractor uses chroma, MFCC, and Mel method, and passes these features to classifiers like K-Nearest Neighbour, Gradient Boosting, Multi-Layer Perceptron, and Random Forest.RESULTS: Among the classifiers in the framework, Multi-Layer Perceptron (MLP) emerges as the top-performing model, showcasing remarkable accuracies of 99.64%, 99.43%, and 99.31% in the Balanced TESS Dataset, Imbalanced TESS (Half) Dataset, and Imbalanced TESS (Quarter) Dataset, respectively. K-Nearest Neighbour (KNN) follows closely as the second-best classifier, surpassing MLP's accuracy only in the Imbalanced TESS (Half) Dataset at 99.52%.CONCLUSION: This research contributes valuable insights into effective emotion recognition through speech, shedding light on the nuances of classification in imbalanced datasets.