Humans rapidly detect and interpret sensory signals that have emotional meaning. Facial expressions—particularly important signifiers of emotion—are processed by a network of brain regions including amygdala and posterior superior temporal sulcus (pSTS). However, the precise computations these regions perform, and whether emotion-specific representations explain their responses to socially complex, dynamic stimuli remain contentious. Here we investigated whether representations from artificial neural networks (ANNs) optimized to recognize emotion from facial expressions alone or the broader visual context differ in their ability to predict human pSTS and amygdala activity. We found that representations of facial expressions were encoded in pSTS, but not the amygdala, whereas representations related to visual context were encoded in both regions. These findings demonstrate how the pSTS may operate on abstract representations of facial expressions, such as ‘fear’ and ‘joy’, whereas the amygdala encodes the emotional significance of visual information more broadly.