Despite recent efforts, accuracy in group emotion recognition is still generally low. One of the reasons for these underwhelming performance levels is the scarcity of available labeled data which, like the literature approaches, is mainly focused on still images. In this work, we address this problem by adapting an inflated ResNet-50 pretrained for a similar task, activity recognition, where large labeled video datasets are available. Audio information is processed using a Bidirectional Long Short-Term Memory (Bi-LSTM) network receiving extracted features. A multimodal approach fuses audio and video information at the score level using a support vector machine classifier. Evaluation with data from the EmotiW 2020 AV Group-Level Emotion sub-challenge shows a final test accuracy of 65.74% for the multimodal approach, approximately 18% higher than the official baseline. The results show that using activity recognition pretraining offers performance advantages for groupemotion recognition and that audio is essential to improve the accuracy and robustness of video-based recognition.
With the advent of self-driving cars and the push by large companies into fully driverless transportation services, monitoring passenger behaviour in vehicles is becoming increasingly important for several reasons, such as ensuring safety and comfort. Although several human action recognition (HAR) methods have been proposed, developing a true HAR system remains a very challenging task. If the dataset used to train a model contains a small number of actors, the model can become biased towards these actors and their unique characteristics. This can cause the model to generalise poorly when confronted with new actors performing the same actions. This limitation is particularly acute when developing models to characterise the activities of vehicle occupants, for which data sets are short and scarce. In this study, we describe and evaluate three different methods that aim to address this actor bias and assess their performance in detecting in-vehicle violence. These methods work by removing specific information about the actor from the model's features during training or by using data that is independent of the actor, such as information about body posture. The experimental results show improvements over the baseline model when evaluated with real data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.