The presence of divergent and independent research traditions in the gestural and vocal domains of primate communication has resulted in major discrepancies in the definition and operationalization of cognitive concepts. However, in recent years, accumulating evidence from behavioural and neurobiological research has shown that both human and non‐human primate communication is inherently multimodal. It is therefore timely to integrate the study of gestural and vocal communication. Herein, we review evidence demonstrating that there is no clear difference between primate gestures and vocalizations in the extent to which they show evidence for the presence of key language properties: intentionality, reference, iconicity and turn‐taking. We also find high overlap in the neurobiological mechanisms producing primate gestures and vocalizations, as well as in ontogenetic flexibility. These findings confirm that human language had multimodal origins. Nonetheless, we note that in great apes, gestures seem to fulfil a carrying (i.e. predominantly informative) role in close‐range communication, whereas the opposite holds for face‐to‐face interactions of humans. This suggests an evolutionary shift in the carrying role from the gestural to the vocal stream, and we explore this transition in the carrying modality. Finally, we suggest that future studies should focus on the links between complex communication, sociality and cooperative tendency to strengthen the study of language origins.