Speech emotion recognition (SER) methods rely on frames to analyze the speech data. However, the existing methods typically divide a speech sample into smaller speech frames and label them with a single emotional tag, which fails to consider the possibility of multiple emotion tags coexisting within a speech sample. To deal with this limitation, we present a novel approach called self-labeling learning ensemble via DRNN and self-representation (En-DRNN-SR) for SER. This method automatically segments speech sample into speech frames, and then the deep recurrent neural network (DRNN) is applied to learn the deep features, and next the self-representation is built to get a relational degree matrix, finally the speech frames is divided into three parts using a relational degree matrix: the key emotional frames, the compatible emotional frames and the noise frames. The emotion tags of the compatible emotional frames are adaptive cyclic learned based on the key emotion frames vias the relational degree matrix, while also checking the emotion tags associated with the key compatible frames. Additionally, we introduce a new self-labeling criterion based on fuzzy membership degree for SER. To evaluate the feasibility and effectiveness of the proposed En-DRNN-SR, we conducted extensive experiments on IEMOCAP, EMODB, and SAVEE database, the proposed En-DRNN-SR obtains 69.13%, 82.83%, and 52.31% results on IEMOCAP, EMODB, and SAVEE database, which outperformed all competing algorithms. The experimental results clearly demonstrate that the proposed approach outperforms state-of-the-art SER methods, achieving superior performance on feature learning and classification.