Multimodal speech recognition of a person with articulation disorders using AAM and MAF

Miyamoto, Chikoto; Komai, Yuto; Takiguchi, Tetsuya; Ariki, Yasuo; Li, Ichao

doi:10.1109/mmsp.2010.5662075

Cited by 20 publications

(12 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Ref. [10], we used multiple acoustic frames (MAF) as an acoustic dynamic feature to improve the recognition rate of a person with an articulation disorder, especially in speech recognition using dynamic features only.…”

Section: Related Workmentioning

confidence: 99%

Audio-Visual Speech Recognition Using Convolutive Bottleneck Networks for a Person with Severe Hearing Loss

Takashima

Kakihara

Aihara

et al. 2015

IPSJ Transactions on Computer Vision and Applications

Self Cite

View full text Add to dashboard Cite

In this paper, we propose an audio-visual speech recognition system for a person with an articulation disorder resulting from severe hearing loss. In the case of a person with this type of articulation disorder, the speech style is quite different from with the result that of people without hearing loss that a speaker-independent model for unimpaired persons is hardly useful for recognizing it. We investigate in this paper an audio-visual speech recognition system for a person with severe hearing loss in noisy environments, where a robust feature extraction method using a convolutive bottleneck network (CBN) is applied to audio-visual data. We confirmed the effectiveness of this approach through word-recognition experiments in noisy environments, where the CBN-based feature extraction method outperformed the conventional methods.

show abstract

Section: Related Workmentioning

confidence: 99%

Audio-Visual Speech Recognition Using Convolutive Bottleneck Networks for a Person with Severe Hearing Loss

Takashima

Kakihara

Aihara

et al. 2015

IPSJ Transactions on Computer Vision and Applications

Self Cite

View full text Add to dashboard Cite

show abstract

“…This application can be launched on Android ™ cell phones and tablet computers. Here, model training and recognition methods of the recog nizer which accepts connected-digit utterances are described: In the visual modality, several features have been proposed; for example, discrete-cosine-transform results and optical flow-based parameters [3] as pixel-based features, alternatively, Active Appearance Model (AAM) parameters [5] as model based features. These features have been investigated and com pared in [13].…”

Section: B Avasr On Smart Cell Phonesmentioning

confidence: 99%

“…Face detection, that is realized as a hardware function of the camera, is conducted in each picture to obtain a face region. A monochrome captured image with face detection results are then stored into a visual Name/Issuer AT R [1] Tokyo Tech [3] Tokyo Tech [16] M2TINIT [9] Tokyo Tech [17] Tokyo Tech [18] Kobe Univ [5] CENSREC-I-AV [15] CENSREC-2-AV [19] frame. ii.…”

Section: B Avasr On Smart Cell Phonesmentioning

confidence: 99%

“…Most multi-modal speech recognition schemes employ visual information: face, mouth or lip images. Audio-Visual ASR (AVASR) has been investigated by many researchers [1], [2], [3], [4], [5], [6]. To day many mobile computers have not only microphone but also embedded camera to capture ones' user; such equipments are often used for video communication applications.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Data collection for mobile audio-visual speech recognition in various environments

Tamura

Seko

Hayamizu

2014

2014 17th Oriental Chapter of the International Committee for the Co-Ordination and Standardization of Speech Databases and Ass

View full text Add to dashboard Cite

Ahstract-This paper introduces our recent activities for audio-visual speech recognition on mobile devices and data collection in various environments. Audio-visual automatic speech recognition is effective in noisy or real conditions to enhance the robustness of speech recognizer and to improve the recognition accuracy. We have developed an audio-visual speech recognition interface for mobile devices. In order to evaluate the recognizer and investigate issues related to audio-visual processing on mobile computers, we collected speech data and lip images of 16 subjects in eight conditions, where there were various audio noises and visual difficulties. Audio-only speech recognition and visual-only lipreading were then conducted. Through these experiments, we found some issues and future works not only for construction of audio-visual database but also for robust audio-visual speech recognition. I. I NT RODUCTIONRecently, a lot of mobile devices such as tablet comput ers and smart cell phones, have widely spread all over the world. As the technology of Automatic Speech Recognition (ASR) has been developed, nowadays most mobile devices have speech recognizer, since keyboard-based interface is not suitable for such the devices. These devices are often used in noisy conditions or real environments, however, the recognition performance sometimes decreases due to background noises.In order to overcome the degradation and to investigate noise-robust speech recognition techniques, large-scale speech corpus is essential. Despite there are many speech corpora available, it is still important to collect speech data for mobile devices in real environments; noise-robust speech technologies should be developed and evaluated using the data. In addition, we must take the computational load and real-time processing on the devices into account.There are several techniques to enhance the robustness of speech recognizer: e.g. beam forming, spectral subtraction, cepstal mean subtraction, and model adaptation. Multi-modal speech recognition, which incorporates speech data and the other information, is one of the methods. Most multi-modal speech recognition schemes employ visual information: face, mouth or lip images. Audio-Visual ASR (AVASR) has been investigated by many researchers [1], [2], [3], [4], [5], [6]. To day many mobile computers have not only microphone but also embedded camera to capture ones' user; such equipments are often used for video communication applications. These microphone and camera also make AVASR available on mobile devices as a noise-robust speech recognizer. So it is expected to realize a mobile AVASR system. There are some databases available for AVASR and the other audio-visual processing, e.g. audio-visual Voice Activity Detection (VAD) [7], [8] and audio-visual speech synthesis or voice conversion [9], [10], [11]; M2TINIT database [9] has been often employed for such the purposes, CENSREC-l AV and CENSREC-2-AV [12] are the other examples, which includes not only audio-visual speech data but also a recog...

show abstract

“…In [9], we proposed robust feature extraction based on principal component analysis (PCA) with more stable utterance data instead of DCT. In [10], we used multiple acoustic frames (MAF) as an acoustic dynamic feature to improve the recognition rate of a person with an articulation disorder, especially in speech recognition using dynamic features only. In spite of these efforts, the recognition rate for articulation disorders is still lower than that of physically unimpaired persons.…”

Section: Introductionmentioning

confidence: 99%

A preliminary demonstration of exemplar-based voice conversion for articulation disorders using an individuality-preserving dictionary

Aihara

Takashima

Takiguchi

et al. 2014

J AUDIO SPEECH MUSIC PROC.

Self Cite

View full text Add to dashboard Cite

We present in this paper a voice conversion (VC) method for a person with an articulation disorder resulting from athetoid cerebral palsy. The movement of such speakers is limited by their athetoid symptoms, and their consonants are often unstable or unclear, which makes it difficult for them to communicate. In this paper, exemplar-based spectral conversion using nonnegative matrix factorization (NMF) is applied to a voice with an articulation disorder. To preserve the speaker's individuality, we used an individuality-preserving dictionary that is constructed from the source speaker's vowels and target speaker's consonants. Using this dictionary, we can create a natural and clear voice preserving their voice's individuality. Experimental results indicate that the performance of NMF-based VC is considerably better than conventional GMM-based VC.

show abstract

Multimodal speech recognition of a person with articulation disorders using AAM and MAF

Cited by 20 publications

References 10 publications

Audio-Visual Speech Recognition Using Convolutive Bottleneck Networks for a Person with Severe Hearing Loss

Audio-Visual Speech Recognition Using Convolutive Bottleneck Networks for a Person with Severe Hearing Loss

Data collection for mobile audio-visual speech recognition in various environments

A preliminary demonstration of exemplar-based voice conversion for articulation disorders using an individuality-preserving dictionary

Contact Info

Product

Resources

About