The natural ecology of human language is face-to-face interaction comprising the exchange of a plethora of multimodal signals. Trying to understand the psycholinguistic processing of language in its natural niche raises new issues, first and foremost the binding of multiple, temporally offset signals under tight time constraints posed by a turn-taking system. This might be expected to overload and slow our cognitive system, but the reverse is in fact the case. We propose cognitive mechanisms that may explain this phenomenon and call for a multimodal, situated psycholinguistic framework to unravel the full complexities of human language processing. A Binding Problem at the Core of Language Language as it is used in its central ecological nichethat is, in face-to-face interactionis embedded in multimodal displays by both speaker and addressee. This is the niche in which it is learned, in which it evolved, and where the bulk of language usage occurs. Communication in this niche involves a complex orchestration of multiple articulators (see Glossary) and modalities: messages are auditory as well as visual, as they are spread across speech, nonspeech vocalizations, and the head, face, hands, arms, and torso. From the point of view of the recipient, this ought in principle to raise two serious computational challenges. First, not all bodily or facial movements are intended as part of the signal or contentthe incidental but irrelevant movements must be set aside (we call this the segregation problem); second, those that seem to be part of the message have to be paired with their counterparts (as when we say 'There!' and point), and simultaneity alone turns out to be an unreliable cue (this is our binding problem). In this Opinion article, we ask how the multiple signals carried by multiple articulators and on different modalities can be combined rapidly to build the phenomenology of a coherent message in the temporally demanding context of conversational speech.