This article addresses the need to better understand interactional asymmetries, challenges, and solutions in implementing synchronous hybrid language teaching. We investigate video-recorded peer interactions in a higher education language teaching context in which a student uses a telepresence robot, a remotely moveable videoconferencing tool, to participate in small-group task work in L2 English together with students who are physically located in the language classroom. Drawing on multimodal conversation analysis, we examine how the geographically dispersed peer group achieves, maintains, and repairs their joint attention on task-relevant learning materials as they are accomplishing a task, and how this kind of referential interactional work enables their co-operation as a group. Based on the analysis, we argue that in synchronous hybrid learning there is a need to reflexively adjust interactional practices to secure an intersubjective understanding of learning tasks and their progressivity. The findings also suggest that sensory and interactional asymmetries should be taken into account when developing and implementing synchronous hybrid learning environments that aim at equality of opportunities regardless of the participation mode.