Life and social sciences often focus on the social nature of music (and language alike). In biology, for example, the three main evolutionary hypotheses about music (i.e., sexual selection, parent-infant bond, and group cohesion) stress its intrinsically social character (Honing et al., 2015). Neurobiology thereby has investigated the neuronal and hormonal underpinnings of musicality for more than two decades (Chanda and Levitin, 2013; Salimpoor et al., 2015; Mehr et al., 2019). In line with these approaches, the present paper aims to suggest that the proper way to capture the social interactive nature of music (and, before it, musicality), is to conceive of it as an embodied language, rooted in culturally adapted brain structures (Clarke et al., 2015; D’Ausilio et al., 2015). This proposal heeds Ian Cross’ call for an investigation of music as an “interactive communicative process” rather than “a manifestation of patterns in sound” (Cross, 2014), with an emphasis on its embodied and predictive (coding) aspects (Clark, 2016; Leman, 2016; Koelsch et al., 2019). In the present paper our goal is: (i) to propose a framework of music as embodied language based on a review of the major concepts that define joint musical action, with a particular emphasis on embodied music cognition and predictive processing, along with some relevant neural underpinnings; (ii) to summarize three experiments conducted in our laboratories (and recently published), which provide evidence for, and can be interpreted according to, the new conceptual framework. In doing so, we draw on both cognitive musicology and neuroscience to outline a comprehensive framework of musical interaction, exploring several aspects of making music in dyads, from a very basic proto-musical action, like tapping, to more sophisticated contexts, like playing a jazz standard and singing a hocket melody. Our framework combines embodied and predictive features, revolving around the concept of joint agency (Pacherie, 2012; Keller et al., 2016; Bolt and Loehr, 2017). If social interaction is the “default mode” by which human brains communicate with their environment (Hari et al., 2015), music and musicality conceived of as an embodied language may arguably provide a route toward its navigation.