Abstract-A novel model is presented to learn bimodally informative structures from audio-visual signals. The signal is represented as a sparse sum of audio-visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio-temporal visual basis function. To represent an audio-visual signal, the kernels can be positioned independently and arbitrarily in space and time. The proposed algorithm uses unsupervised learning to form dictionaries of bimodal kernels from audio-visual material. The basis functions that emerge during learning capture salient audio-visual data structures. In addition it is demonstrated that the learned dictionary can be used to locate sources of sound in the movie frame. Specifically, in sequences containing two speakers the algorithm can robustly localize a speaker even in the presence of severe acoustic and visual distracters. I. BACKGROUND AND SIGNIFICANCETo smoothly interact with our environment we must be able to analyze and understand complex relationships between the inputs to different sensory modalities. Not surprisingly, this behavioral requirement of multimodal processing is reflected by corresponding observations in brain research. A fast growing body of experimental evidence suggests that different sensory modalities in the brain do not operate in isolation but exhibit interactions at various levels of sensory processing [1][2][3][4][5][6][7][8]. Also the fields of signal processing and computer vision have recently seen the development of perception-inspired audio-visual fusion algorithms. Examples include methods for speech-speaker recognition [9] and speaker detection aided by video [10,11], audio filtering and separation based on video [12][13][14][15][16], or audio-visual sound source localization [17][18][19][20][21][22][23][24][25][26].Typically, algorithms for audio-visual fusion exploit synchronous co-occurrences of transient structures in the different modalities. In their pioneering work, Hershey and Movellan [17] localized sound sources in the image frame by computing the correlation between acoustic energy and intensity change in single pixels. Recently, more sophisticated feature representations have been proposed, for example, audio features derived from audio energy
Abstract-Real-world phenomena involve complex interactions between multiple signal modalities. As a consequence, humans are used to integrate at each instant perceptions from all their senses in order to enrich their understanding of the surrounding world. This paradigm can be also extremely useful in many signal processing and computer vision problems involving mutually related signals. The simultaneous processing of multimodal data can, in fact, reveal information that is otherwise hidden when considering the signals independently. However, in natural multimodal signals, the statistical dependencies between modalities are in general not obvious. Learning fundamental multimodal patterns could offer deep insight into the structure of such signals. In this paper, we present a novel model of multimodal signals based on their sparse decomposition over a dictionary of multimodal structures. An algorithm for iteratively learning multimodal generating functions that can be shifted at all positions in the signal is proposed, as well. The learning is defined in such a way that it can be accomplished by iteratively solving a generalized eigenvector problem, which makes the algorithm fast, flexible, and free of user-defined parameters. The proposed algorithm is applied to audiovisual sequences and it is able to discover underlying structures in the data. The detection of such audio-video patterns in audiovisual clips allows to effectively localize the sound source on the video in presence of substantial acoustic and visual distractors, outperforming state-of-the-art audiovisual localization algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.