A robot agent designed to engage in real-world human-robot joint action must be able to understand the social states of the human users it interacts with in order to behave appropriately. In particular, in a dynamic public space, a crucial task for the robot is to determine the needs and intentions of all of the people in the scene, so that it only interacts with people who intend to interact with it. We address the task of estimating the engagement state of customers for a robot bartender based on the data from audiovisual sensors. We begin with an offline experiment using hidden Markov models, confirming that the sensor data contains the information necessary to estimate user state. We then present two strategies for online state estimation: a rule-based classifier based on observed human behaviour in real bars, and a set of supervised classifiers trained on a labelled corpus. These strategies are compared in offline cross-validation, in an online user study, and through validation against a separate test corpus. These studies show that while the trained classifiers are best in a cross-validation setting, the rule-based classifier performs best with novel data; however, all classiThis article integrates and extends the work described in the following conference papers: [17,20,23,24 Bristol Robotics Laboratory, University of the West of England, Bristol, UK fiers also change their estimate too frequently for practical use. To address this issue, we present a final classifier based on Conditional Random Fields: this model has comparable performance on the test data, with increased stability. In summary, though, the rule-based classifier shows competitive performance with the trained classifiers, suggesting that for this task, such a simple model could actually be a preferred option, providing useful online performance while avoiding the implementation and data-scarcity issues involved in using machine learning for this task.