We propose a hybrid network-based learning framework for speaker-adaptive vocal emotion conversion, tested on three different datasets (languages), namely, EmoDB (German), IITKGP (Telugu), and SAVEE (English). The optimized learning model introduced is unique because of its ability to synthesize emotional speech with an acceptable perceptive quality while preserving speaker characteristics. The multilingual model is extremely beneficial in scenarios wherein emotional training data from a specific target speaker are sparsely available. The proposed model uses speaker-normalized mel-generalized cepstral coefficients for spectral training with data adaptation using the seed data from the target speaker. The fundamental frequency (F0) is transformed using a wavelet synchrosqueezed transform prior to mapping to obtain a sharpened time-frequency representation. Moreover, a feedforward artificial neural network, together with particle swarm optimization, was used for F0 training. Additionally, static-intensity modification was also performed for each test utterance. Using the framework, we were able to capture the spectral and pitch contour variabilities of emotional expression better than with other state-of-the-art methods used in this study. Considering the overall performance scores across datasets, an average melcepstral distortion (MCD) of 4.98 and root mean square error (RMSE-F0) of 10.67 were obtained in objective evaluations, and an average comparative mean opinion score (CMOS) of 3.57 and speaker similarity score of 3.70 were obtained for the proposed framework. Particularly, the best MCD of 4.09 (EmoDB-happiness) and RMSE-F0 of 9.00 (EmoDB-anger) were obtained, along with the maximum CMOS of 3.7 and speaker similarity of 4.6, thereby highlighting the effectiveness of the hybrid network model.