The current study had two main goals: First, to replicate the ‘bimodal integration’ effect (i.e. the automatic integration of crossmodal stimuli, namely facial emotions and emotional prosody); and second, to investigate whether this phenomenon facilitates or impairs the intake and retention of unattended verbal content. The study borrowed from previous bimodal integration designs and included a two-alternative forced-choice (2AFC) task, where subjects were instructed to identify the emotion of a face (as either ‘angry’ or ‘happy’) while ignoring a concurrently presented sentence (spoken in an angry, happy, or neutral prosody), after which a surprise recall was administered to investigate effects on semantic content retention. While bimodal integration effects were replicated (i.e. faster and more accurate emotion identification under congruent conditions), congruency effects were not found for semantic recall. Overall, semantic recall was better for trials with emotional (vs. neutral) faces, and worse in trials with happy (vs. angry or neutral) prosody. Taken together, our findings suggest that when individuals focus their attention on evaluation of facial expressions, they implicitly integrate nonverbal emotional vocal cues (i.e. hedonic valence or emotional tone of accompanying sentences), and devote less attention to their semantic content. While the impairing effect of happy prosody on recall may indicate an emotional interference effect, more research is required to uncover potential prosody-specific effects. All supplemental online materials can be found on OSF (https://osf.io/am9p2/).