Fast and accurate upper-body and head pose estimation is a key task for automatic monitoring of driver attention, a challenging context characterized by severe illumination changes, occlusions and extreme poses. In this work, we present a new deep learning framework for head localization and pose estimation on depth images. The core of the proposal is a regressive neural network, called POSEidon, which is composed of three independent convolutional nets followed by a fusion layer, specially conceived for understanding the pose by depth. In addition, to recover the intrinsic value of face appearance for understanding head position and orientation, we propose a new Facefrom-Depth model for learning image faces from depth. Results in face reconstruction are qualitatively impressive. We test the proposed framework on two public datasets, namely Biwi Kinect Head Pose and ICT-3DHP, and on Pandora, a new challenging dataset mainly inspired by the automotive setup. Results show that our method overcomes all recent state-of-art works, running in real time at more than 30 frames per second.
The field of surveillance and forensics research is currently shifting focus and is now showing an ever increasing interest in the task of people reidentification. This is the task of assigning the same identifier to all instances of a particular individual captured in a series of images or videos, even after the occurrence of significant gaps over time or space. People reidentification can be a useful tool for people analysis in security as a data association method for long-term tracking in surveillance. However, current identification techniques being utilized present many difficulties and shortcomings. For instance, they rely solely on the exploitation of visual cues such as color, texture, and the object’s shape. Despite the many advances in this field, reidentification is still an open problem. This survey aims to tackle all the issues and challenging aspects of people reidentification while simultaneously describing the previously proposed solutions for the encountered problems. This begins with the first attempts of holistic descriptors and progresses to the more recently adopted 2D and 3D model-based approaches. The survey also includes an exhaustive treatise of all the aspects of people reidentification, including available datasets, evaluation metrics, and benchmarking.
Multi-PeopleTracking in an open-world setting requires a special effort in precise detection. Moreover, temporal continuity in the detection phase gains more importance when scene cluttering introduces the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts and associates them across short temporal spans. Our model explicitly deals with occluded body parts, by hallucinating plausible solutions of not visible joints. We propose a new end-to-end architecture composed by four branches (visible heatmaps, occluded heatmaps, part affinity fields and temporal affinity fields) fed by a time linker feature extractor. To overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, almost 10 million body poses) of human body parts for people tracking in urban scenarios. Our architecture trained on virtual data exhibits good generalization capabilities also on public real tracking benchmarks, when image resolution and sharpness are high enough, producing reliable tracklets useful for further batch data association or re-id modules.
In-house video surveillance can represent an excellent support for people with some difficulties (e.g. elderly or disabled people) living alone and with a limited autonomy. New hardware technologies and in particular digital cameras are now affordable and they have recently gained credit as tools for (semi-) automatically assuring people's safety. In this paper a multi-camera vision system for detecting and tracking people and recognizing dangerous behaviours and events such as a fall is presented. In such a situation a suitable alarm can be sent, e.g. by means of an SMS. A novel technique of warping people's silhouette is proposed to exchange visual information between partially overlapped cameras whenever a camera handover occurs. Finally, a multi-client and multi-threaded transcoding video server delivers live video streams to operators=remote users in order to check the validity of a received alarm. Semantic and event-based transcoding algorithms are used to optimize the bandwidth usage. A two-room setup has been created in our laboratory to test the performance of the overall system and some of the results obtained are reported.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.