In social interactions, humans can express how they feel in what (verbal) they say and how (non‐verbal) they say it. Although decoding of vocal emotion expressions occurs rapidly, accumulating electrophysiological evidence suggests that this process is multilayered and involves temporally and functionally distinct processing steps. Neuroimaging and lesion data confirm that these processing steps, which support emotional speech and language comprehension, are anchored in a functionally differentiated brain network. The present review on emotional speech and language processing discusses concepts and empirical clinical and neuroscientific evidence on the basis of behavioral, event‐related brain potential, and functional magnetic resonance imaging data. These data allow shaping our understanding of how we communicate emotions to others through speech and language. It leads to a multistep processing model of vocal and visual emotion expressions.