Multi-modal and multi-camera attention in smart environments

2010 IEEE/RSJ International Conference on Intelligent Robots and Systems

2010

Abstract-When persons interact, non-verbal cues are used to direct the attention of persons towards objects of interest. Achieving joint attention this way is an important aspect of natural communication. Most importantly, it allows to couple verbal descriptions with the visual appearance of objects, if the referred-to object is non-verbally indicated. In this contribution, we present a system that utilizes bottom-up saliency and pointing gestures to efficiently identify pointed-at objects. Furthermore, the system focuses the visual attention by steering a pan-tilt-zoom camera towards the object of interest and thus provides a suitable model-view for SIFT-based recognition and learning. We demonstrate the practical applicability of the proposed system through experimental evaluation in different environments with multiple pointers and objects.

Section: Related Workmentioning

confidence: 99%

“…[8]) or audio-visually (e.g. [10], [12]) salient regions is a natural and efficient method to detect and focus interacting persons.…”

Section: Realizationmentioning

confidence: 99%

Saliency-based identification and recognition of pointed-at objects

Schauerte

Richarz

2010 IEEE/RSJ International Conference on Intelligent Robots and Systems

2010

“…5.1), this is sufficient. Note that, in a multi-camera setting, a view selection algorithm can be applied choosing the two "best" views according to some global criteria [6]. Aligned trajectory points are then projected to 3D by ray casting.…”

Section: D Combinationmentioning

confidence: 99%

“…In particular, we propose representing a gesture by projection on its principal plane of motion, which we call the action plane. For the acquisition of gesture trajectories, we build upon our previous work on 3D pointing gesture recognition [5] and saliency-based view selection in multi-camera setups [6].…”

Section: Introductionmentioning

confidence: 99%

Feature Representations for the Recognition of 3D Emblematic Gestures

Richarz

Human Behavior Understanding

2010

Self Cite

Abstract. In human-machine interaction, gestures play an important role as input modality for natural and intuitive interfaces. The class of gestures often called "emblems" is of special interest since they convey a well-defined meaning in an intuitive way. We present an approach for the visual recognition of 3D dynamic emblematic gestures in a smart room scenario using a HMM-based recognition framework. In particular, we assess the suitability of several feature representations calculated from a gesture trajectory in a detailed experimental evaluation on realistic data.

“…For this purpose, the identities of the persons in the room have to be determined (see (Salah et al, 2008)) as well as the audio-visual focus of attention has to be estimated (see (Voit and Stiefelhagen, 2010;Schauerte et al, 2009)), e.g. to present personalized information on the display a person is currently looking at.…”

Section: Introductionmentioning

confidence: 99%

Sift-Based Camera Localization Using Reference Objects for Application in Multi-Camera Environments and Robotics

Jaspers¹,

Schauerte

Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods

2012

Abstract:In this contribution, we present a unified approach to improve the localization and the perception of a robot in a new environment by using already installed cameras. Using our approach we are able to localize arbitrary cameras in multi-camera environments while automatically extending the camera network in an online, unattended, real-time way. This way, all cameras can be used to improve the perception of the scene, and additional cameras can be added in real-time, e.g., to remove blind spots. To this end, we use the Scale-invariant feature transform (SIFT) and at least one arbitrary known-size reference object to enable camera localization. Then we apply non-linear optimization of the relative pose estimate and we use it to iteratively calibrate the camera network as well as to localize arbitrary cameras, e.g. of mobile phones or robots, inside a multi-camera environment. We performed an evaluation on synthetic as well as real data to demonstrate the applicability of the proposed approach.