We present a camera pointing system controlled by real-time calculations of sound source locations from a microphone array. Traditional audio localization techniques require explicit estimates of the spatial coordinates for each microphone in the array. In addition, positional information for the camera is needed to use such techniques to drive a camera pointing system. Sometimes this positioning can be done by hand, but for large aperture microphone arrays with many elements this is impractical. We show that in this setting, where elements are placed in an ad-hoc manner, explicitly learning the microphone positions is an unnecessary step. We give a calibration method whose focus is learning the mapping from time delays between pairs of microphones to the associated pan and tilt a PTZ-camera should be given to point at. This curtails the need to explicitly learn the microphone and camera positions. We use this method to calibrate a real-time camera pointing system used by the UCSD interactive display.
We have built a system that engages naive users in an audiovisual interaction with a computer in an unconstrained public space. We combine audio source localization techniques with face detection algorithms to detect and track the user throughout a large lobby. The sensors we use are an ad-hoc microphone array and a PTZ camera. To engage the user, the PTZ camera turns and points at sounds made by people passing by. From this simple pointing of a camera, the user is made aware that the system has acknowledged their presence. To further engage the user, we develop a face classification method that identifies and then greets previously seen users. The user can interact with the system through a simple hot-spot based gesture interface. To make the user interactions with the system feel natural, we utilize reconfigurable hardware, achieving a visual response time of less than 100ms. We rely heavily on machine learning methods to make our system self-calibrating and adaptive.
where c is the speed of sound in the medium. ∆ ij is often called the time delay of arrival (TDOA) between microphone i and j. It is worth noting that if f is the sampling rate being used, then the largest the TDOA can be in terms of audio samples is M = m i − m j 2 f /c. In other words, ∆ ij is always in the range [−M, M] and in practice can only be estimated to the nearest sample. This observation directly reveals the fact that close together microphones cannot have as wide a range of TDOAs as microphones that are spaced further apart. Placing microphones further 318 Advances in Sound Localization
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.