Ahstract-This paper introduces our recent activities for audio-visual speech recognition on mobile devices and data collection in various environments. Audio-visual automatic speech recognition is effective in noisy or real conditions to enhance the robustness of speech recognizer and to improve the recognition accuracy. We have developed an audio-visual speech recognition interface for mobile devices. In order to evaluate the recognizer and investigate issues related to audio-visual processing on mobile computers, we collected speech data and lip images of 16 subjects in eight conditions, where there were various audio noises and visual difficulties. Audio-only speech recognition and visual-only lipreading were then conducted. Through these experiments, we found some issues and future works not only for construction of audio-visual database but also for robust audio-visual speech recognition.
I. I NT RODUCTIONRecently, a lot of mobile devices such as tablet comput ers and smart cell phones, have widely spread all over the world. As the technology of Automatic Speech Recognition (ASR) has been developed, nowadays most mobile devices have speech recognizer, since keyboard-based interface is not suitable for such the devices. These devices are often used in noisy conditions or real environments, however, the recognition performance sometimes decreases due to background noises.In order to overcome the degradation and to investigate noise-robust speech recognition techniques, large-scale speech corpus is essential. Despite there are many speech corpora available, it is still important to collect speech data for mobile devices in real environments; noise-robust speech technologies should be developed and evaluated using the data. In addition, we must take the computational load and real-time processing on the devices into account.There are several techniques to enhance the robustness of speech recognizer: e.g. beam forming, spectral subtraction, cepstal mean subtraction, and model adaptation. Multi-modal speech recognition, which incorporates speech data and the other information, is one of the methods. Most multi-modal speech recognition schemes employ visual information: face, mouth or lip images. Audio-Visual ASR (AVASR) has been investigated by many researchers [1], [2], [3], [4], [5], [6]. To day many mobile computers have not only microphone but also embedded camera to capture ones' user; such equipments are often used for video communication applications. These microphone and camera also make AVASR available on mobile devices as a noise-robust speech recognizer. So it is expected to realize a mobile AVASR system. There are some databases available for AVASR and the other audio-visual processing, e.g. audio-visual Voice Activity Detection (VAD) [7], [8] and audio-visual speech synthesis or voice conversion [9], [10], [11]; M2TINIT database [9] has been often employed for such the purposes, CENSREC-l AV and CENSREC-2-AV [12] are the other examples, which includes not only audio-visual speech data but also a recog...