Conversational emotion recognition (CER) is a significant task due to its application in human–computer interaction. Existing work treats CER as an utterance‐level classification task without considering that empathic response also reflects contextual emotion understanding. Previous work has proven that accurate recognition of emotions in the dialogue history is helpful to generate high‐fit responses. In this paper, we investigate whether this conclusion is a sufficient and necessary condition. Specifically, we define an auxiliary empathic multiturn dialogue generation (MDG) task to enhance emotion understanding. Correspondingly, we present a Sequence‐to‐Sequence oriented framework that combines CER and MDG in a multitask learning manner to verify the complementarity between the two tasks. First, we use alternate recurrent neural networks to encode the content of historical utterances and represent the states of multiparty emotions, which are used for emotion classification. Second, since most MDG methods ignore the emotional coherence of the dialogue context itself, we use affine transformation to fuse hidden states of content and emotions to initialize the decoder. Finally, at each step of generation, an attention mechanism is used to fuse information from the dialogue history to ensure emotional coherence. The CER results of our models outperform the state‐of‐the‐art on three prevalent emotional dialogue data sets. Further analysis demonstrates the mutual promotion and empathy interpretability between MDG and CER. Furthermore, our framework is scalable for different coding strategies and multimodal fusion. To the best of our knowledge, this is the first work to explore CER from the perspective of empathy through multitask learning with dialogue generation.