Humanoid social robots have an increasingly prominent place in today's world. Their acceptance in social and emotional human-robot interaction (HRI) scenarios depends on their ability to convey well recognized and believable emotional expressions to their human users. In this article, we incorporate recent findings from psychology, neuroscience, human-computer interaction, and HRI, to examine how people recognize and respond to emotions displayed by the body and voice of humanoid robots, with a particular emphasis on the effects of incongruence. In a social HRI laboratory experiment, we investigated contextual incongruence (i.e., the conflict situation where a robot's reaction is incongrous with the socio-emotional context of the interaction) and cross-modal incongruence (i.e., the conflict situation where an observer receives incongruous emotional information across the auditory (vocal prosody) and visual (whole-body expressions) modalities). Results showed that both contextual incongruence and cross-modal incongruence confused observers and decreased the likelihood that they accurately recognized the emotional expressions of the robot. This, in turn, gives the impression that the robot is unintelligent or unable to express "empathic" behaviour and leads to profoundly harmful effects on likability and believability. Our findings reinforce the need of proper design of emotional expressions for robots that use several channels to communicate their emotional states in a clear and effective way. We offer recommendations regarding design choices and discuss future research areas in the direction of multimodal HRI.