Retail shoplifting is one of the most prevalent forms of theft and has accounted for over one billion GBP in losses for UK retailers in 2018. An automated approach to detecting behaviours associated with shoplifting using surveillance footage could help reduce these losses. Until recently, most state-of-the-art vision-based approaches to this problem have relied heavily on the use of black box deep learning models. While these models have been shown to achieve very high accuracy, this lack of understanding on how decisions are made raises concerns about potential bias in the models. This limits the ability of retailers to implement these solutions, as several high-profile legal cases have recently ruled that evidence taken from these black box methods is inadmissible in court. There is an urgent need to develop models which can achieve high accuracy while providing the necessary transparency. One way to alleviate this problem is through the use of social signal processing to add a layer of understanding in the development of transparent models for this task. To this end, we present a social signal processing model for the problem of shoplifting prediction which has been trained and validated using a novel dataset of manually annotated shoplifting videos. The resulting model provides a high degree of understanding and achieves accuracy comparable with current state of the art black box methods.
Human activity recognition has been an open problem in computer vision for almost 2 decades. During this time, there have been many approaches proposed to solve this problem, but very few have managed to solve it in a way that is sufficiently computationally efficient for real-time applications. Recently, this has changed, with keypoint-based methods demonstrating a high degree of accuracy with low computational cost. These approaches take a given image and return a set of joint locations for each individual within an image. In order to achieve real-time performance, a sparse representation of these features over a given time frame is required for classification. Previous methods have achieved this using a reduced number of keypoints, but this approach gives a less robust representation of the individual’s body pose and may limit the types of activity that can be detected. We present a novel method for reducing the size of the feature set, by calculating the Euclidian distance and the direction of keypoint changes across a number of frames. This allows for a meaningful representation of the individuals movements over time. We show that this method achieves accuracy on par with current state-of-the-art methods, while demonstrating real-time performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.