“…The recent studies on deep learning have seen remarkable performance, such as DNN [16,39,40], highway neural network [41], deep bi-directional long-short-term memory network (DBLSTM) [42], and sequence-to-sequence model [43,44]. Beyond parallel training data, new techniques have been proposed to learn the translation between emotional domains with CycleGAN [45,46] and StarGAN [47], to disentangle the emotional elements from speech with auto-encoders [48,49,50,51], and to leverage text-to-speech (TTS) [52,53] or automatic speech recognition (ASR) [54]. Such framework generally works well in speaker-dependent tasks.…”