The standard referring-expression generation task involves creating stand-alone descriptions intended solely to distinguish a target object from its context. However, when an artificial system refers to objects in the course of interactive, embodied dialogue with a human partner, this is a very different setting: the references found in situated dialogue are able to take into account aspects of the physical, interactive, and task-level context, and are therefore unlike those found in corpora of stand-alone references. Also, the dominant method of evaluating generated references involves measuring corpus similarity. In an interactive context, though, other extrinsic measures such as task success and user preference are more relevant-and numerous studies have repeatedly found little or no correlation between such extrinsic metrics and the predictions of commonly used corpus-similarity metrics.To explore these issues, we introduce a humanoid robot designed to co-operate with a human partner on a joint construction task. We then describe the context-sensitive reference-generation algorithm that was implemented for use on this robot, which was inspired by the referring phenomena found in the Joint Communication Task corpus of human-human joint construction dialogues. The context-sensitive algorithm was evaluated through two user studies comparing it to a baseline algorithm, using a combination of objective performance measures and subjective user satisfaction scores. In both studies, the objective task performance and dialogue quality were found to be the same for both versions of the system; however, in both cases, the context-sensitive system scored more highly on subjective measures of interaction quality. The generation of referring expressions (GRE) is one of the most clearly defined sub-tasks in natural language generation (NLG), and is therefore one of the tasks that has received the most attention. The classic GRE task involves creating an initial, stand-alone description intended solely to distinguish the target from any "distractors" in the area, and the dominant method of evaluating such systems involves measuring the similarity of the generated references to those drawn from a suitable corpus of human-generated descriptions.In this paper, we consider referring expressions in the context of joint action in a shared workspace, where the dialogue takes place between a human and a humanoid robot. The robot was designed to co-operate with a human partner on a joint construction task, so the referring expressions which it generates take account of the discourse and physical context in which it operates, drawing inspiration from the referring phenomena found in a corpus of human-human dialogues in a similar joint construction scenario.In this interactive, embodied context, the most important measures of success are extrinsic measures such as task success and user subjective opinions-and it is known that the predictions of the typical intrinsic corpus-based evaluation strategies do not tend to correlate w...