Audiovisual speech recognition involves fusion of the audio and video sensors for phonetic identification. There are three basic ways to fuse data streams for taking a decision such as phoneme identification: data-to-decision, decision-to-decision, and data-to-data. This leads to four possible models for audiovisual speech recognition, that is direct identification in the first case, separate identification in the second one, and two variants of the third early integration case, namely dominant recoding or motor recoding. However, no systematic comparison of these models is available in the literature. We propose an implementation of these four models, and submit them to a benchmark test. For this aim, we use a noisy-vowel corpus tested on two recognition paradigms in which the systems are tested at noise levels higher than those used for learning. In one of these paradigms, the signal-to-noise ratio (SNR) value is provided to the recognition systems, in the other it is not. We also introduce a new criterion for evaluating performances, based on transmitted information on individual phonetic features.In light of the compared performances of the four models with the two recognition paradigms, we discuss the advantages and drawbacks of these models, leading to proposals for data representation, fusion architecture, and control of the fusion process through sensor reliability.