With the increasing deployment of autonomous taxis in different cities around the world, recent studies have stressed the importance of developing new methods, models and tools for intuitive human–autonomous taxis interactions (HATIs). Street hailing is one example, where passengers would hail an autonomous taxi by simply waving a hand, exactly like they do for manned taxis. However, automated taxi street-hailing recognition has been explored to a very limited extent. In order to address this gap, in this paper, we propose a new method for the detection of taxi street hailing based on computer vision techniques. Our method is inspired by a quantitative study that we conducted with 50 experienced taxi drivers in the city of Tunis (Tunisia) in order to understand how they recognize street-hailing cases. Based on the interviews with taxi drivers, we distinguish between explicit and implicit street-hailing cases. Given a traffic scene, explicit street hailing is detected using three elements of visual information: the hailing gesture, the person’s relative position to the road and the person’s head orientation. Any person who is standing close to the road, looking towards the taxi and making a hailing gesture is automatically recognized as a taxi-hailing passenger. If some elements of the visual information are not detected, we use contextual information (such as space, time and weather) in order to evaluate the existence of implicit street-hailing cases. For example, a person who is standing on the roadside in the heat, looking towards the taxi but not waving his hand is still considered a potential passenger. Hence, the new method that we propose integrates both visual and contextual information in a computer-vision pipeline that we designed to detect taxi street-hailing cases from video streams collected by capturing devices mounted on moving taxis. We tested our pipeline using a dataset that we collected with a taxi on the roads of Tunis. Considering both explicit and implicit hailing scenarios, our method yields satisfactory results in relatively realistic settings, with an accuracy of 80%, a precision of 84% and a recall of 84%.