Computational auditory scene analysis -modeling the human ability to organize sound mixtures according to their sources -has experienced a rapid evolution as the simple principles suggested by psychological experiments have turned out to be less than the whole story. Phenomena such as the continuity illusion and phonemic restoration show that the brain is able to use a wide range of knowledge-based contextual constraints when interpreting obscured or complex mixtures: To model such processing, we need architectures that operate by confirming hypotheses about the observations rather than relying on directly-extracted descriptions. One such architecture, the 'prediction-driven' approach, is presented along with results from its initial implementation. This architecture can be extended to take advantage of the high-level knowledge implicit in today's speech recognizers by modifying a recognizer to act as one of the 'component models' which provide the explanations of the signal mixture. Although this adaptation raises a number of issues, a preliminary investigation supports the argument that successful scene analysis must exploit such abstract knowledge at every level.
IntroductionThe work described in this paper fits into a kind of evolutionary tale of approaches to sound organization: In the beginning, there was the 'simplistic' or "blank background" view that sound objects somehow defined themselves, and that identifying a single perceptual object was as simple as picking out a figure in a child's coloring book. The experimental stimuli on which so much of our understanding of auditory organization is based -the sinusoids and bandlimited noise bursts of Bregman [1990] and others -echo this approach, since, as presented in soundproof listening booths, they would actually be amenable to such an approach.The second stage of evolution, which we might call the 'optimistic' or "uniform background" view, emerged from the initial efforts to apply the insights of experimental results in auditory organization (especially those in [Bregman 1990]) and apply them to real sounds.Unlike sinusoids against a silent background, real sounds contain all kinds of noise and distractions to defeat simple extraction routines, and therefore demand a more sophisticated approach. However, the signal processing community is long used to dealing with noise and offers various approaches for making the best possible decisions under some simple, but useful, assumptions. These amount to a kind of template matching, such that if the form of the target and the interference can be exactly specified, the parameters of the target can be recovered in the mathematically best-possible fashion. The essence of this approach is that we can produce simple definitions of what we are looking for -sinusoids of unknown frequency, or narrowband noise energy -and we can then go through a given signal identifying and extracting just the parts that interest us, and ignoring the rest -in analogy to the way a human listener is able to 'screen out' interferin...