Signers and speakers coordinate a broad range of intentionally expressive actions within the spatiotemporal context of their face-to-face interactions (Parmentier, 1994; Clark, 1996; Johnston, 1996; Kendon, 2004). Varied semiotic repertoires combine in different ways, the details of which are rooted in the interactions occurring in a specific time and place (Goodwin, 2000; Kusters et al., 2017). However, intense focus in linguistics on conventionalized symbolic form/meaning pairings (especially those which are arbitrary) has obscured the importance of other semiotics in face-to-face communication. A consequence is that the communicative practices resulting from diverse ways of being (e.g., deaf, hearing) are not easily united into a global theoretical framework. Here we promote a theory of language that accounts for how diverse humans coordinate their semiotic repertoires in face-to-face communication, bringing together evidence from anthropology, semiotics, gesture studies and linguistics. Our aim is to facilitate direct comparison of different communicative ecologies. We build on Clark’s (1996) theory of language use as ‘actioned’ via three methods of signaling: describing, indicating, and depicting. Each method is fundamentally different to the other, and they can be used alone or in combination with others during the joint creation of multimodal ‘composite utterances’ (Enfield, 2009). We argue that a theory of language must be able to account for all three methods of signaling as they manifest within and across composite utterances. From this perspective, language—and not only language use—can be viewed as intentionally communicative action involving the specific range of semiotic resources available in situated human interactions.