Speech Emotion Recognition (SER) identifies and categorizes emotional states by analyzing speech signals. The intensity of specific emotional expressions (e.g., anger) conveys critical directives and plays a crucial role in social behavior. SER is intrinsically language-specific; this study investigated a novel cascaded deep learning (DL) model to Bangla SER with intensity level. The proposed method employs the Mel-Frequency Cepstral Coefficient, Short-Time Fourier Transform (STFT), and Chroma STFT signal transformation techniques; the respective transformed features are blended into a 3D form and used as the input of the DL model. The cascaded model performs the task in two stages: classify emotion in Stage 1 and then measure the intensity in Stage 2. DL architecture used in both stages is the same, which consists of a 3D Convolutional Neural Network (CNN), a Time Distribution Flatten (TDF) layer, a Long Short-term Memory (LSTM), and a Bidirectional LSTM (Bi-LSTM). CNN first extracts features from 3D formed input; the features are passed through the TDF layer, Bi-LSTM, and LSTM; finally, the model classifies emotion along with its intensity level. The proposed model has been evaluated rigorously using developed KBES and other datasets. The proposed model revealed as the best-suited SER method compared to existing prominent methods achieving accuracy of 88.30% and 71.67% for RAVDESS and KBES datasets, respectively.