The efficacy of audio-visual interactions in speech perception comes from two kinds of factors. First, at the information level, there is some "complementarity" of audition and vision: It seems that some speech features, mainly concerned with manner of articulation, are best transmitted by the audio channel, while some other features, mostly describing place of articulation, are best transmitted by the video channel. Second, at the information processing level, there is some "synergy" between audition and vision: The audio-visual global identification scores in a number of different tasks involving acoustic noise are generally greater than both the auditory-alone and the visual-alone scores. However, these two properties have been generally demonstrated until now in rather global terms. In the present work, audio-visual interactions at the feature level are studied for French oral vowels which contrast three series, namely front unrounded, front rounded, and back rounded vowels. A set of experiments on the auditory, visual, and audio-visual identification of vowels embedded in various amounts of noise demonstrate that complementarity and synergy in bimodal speech appear to hold for a bundle of individual phonetic features describing place contrasts in oral vowels. At the information level (complementarity), in the audio channel the height feature is the most robust, backness the second most robust one, and rounding the least, while in the video channel rounding is better than height, and backness is almost invisible. At the information processing (synergy) level, transmitted information scores show that all individual features are better transmitted with the ear and the eye together than with each sensor individually.
Audiovisual speech recognition involves fusion of the audio and video sensors for phonetic identification. There are three basic ways to fuse data streams for taking a decision such as phoneme identification: data-to-decision, decision-to-decision, and data-to-data. This leads to four possible models for audiovisual speech recognition, that is direct identification in the first case, separate identification in the second one, and two variants of the third early integration case, namely dominant recoding or motor recoding. However, no systematic comparison of these models is available in the literature. We propose an implementation of these four models, and submit them to a benchmark test. For this aim, we use a noisy-vowel corpus tested on two recognition paradigms in which the systems are tested at noise levels higher than those used for learning. In one of these paradigms, the signal-to-noise ratio (SNR) value is provided to the recognition systems, in the other it is not. We also introduce a new criterion for evaluating performances, based on transmitted information on individual phonetic features.In light of the compared performances of the four models with the two recognition paradigms, we discuss the advantages and drawbacks of these models, leading to proposals for data representation, fusion architecture, and control of the fusion process through sensor reliability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.