Investigating Objective Intelligibility in Real-Time EMG-to-Speech Conversion

Diener, Lorenz; Schultz, Tanja

doi:10.21437/interspeech.2018-2080

Cited by 7 publications

(7 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These low values can only be achieved through direct speech synthesis. In this sense, real-time SSI systems have been developed for sEMG [181], [182], PMA [183] and EMA [27]. There is also the possibility that real-time auditory feedback might enable the brain to assimilate the SSI as if it were the person's own voice, thus enabling the user to adapt her/his own speaking patterns to produce better acoustics.…”

Section: Comparison Of the Two Ssi Approachesmentioning

confidence: 99%

“…Direct speech synthesis from EMG signals has also progressed considerably in recent years (see [31], [99], [181], [182]), following advances in array sEMG sensors and deep learning. As mentioned above, a particular advantage of EMG with respect to other techniques for articulator motion capture is that EMG signals can be sensed~60 ms before the actual movements of the articulators.…”

Section: ) Ssis Based On Emg Signalsmentioning

confidence: 99%

“…As mentioned above, a particular advantage of EMG with respect to other techniques for articulator motion capture is that EMG signals can be sensed~60 ms before the actual movements of the articulators. This rapidity facilitates the development of real-time direct synthesis systems with low latency [181], [182], so that the delay between the articulatory gestures and the synthesised acoustic feedback is minimal. In [182], a comprehensive study was carried out in which the influence of various system parameters (DNN size, amount of training data, frame shift, etc.)…”

Section: ) Ssis Based On Emg Signalsmentioning

confidence: 99%

“…This rapidity facilitates the development of real-time direct synthesis systems with low latency [181], [182], so that the delay between the articulatory gestures and the synthesised acoustic feedback is minimal. In [182], a comprehensive study was carried out in which the influence of various system parameters (DNN size, amount of training data, frame shift, etc.) on the speech quality generated by a real-time direct synthesis system was analysed using objective quality metrics.…”

Section: ) Ssis Based On Emg Signalsmentioning

confidence: 99%

See 3 more Smart Citations

Silent Speech Interfaces for Speech Restoration: A Review

et al. 2020

View full text Add to dashboard Cite

This review summarises the status of silent speech interface (SSI) research. SSIs rely on non-acoustic biosignals generated by the human body during speech production to enable communication whenever normal verbal communication is not possible or not desirable. In this review, we focus on the first case and present latest SSI research aimed at providing new alternative and augmentative communication methods for persons with severe speech disorders. SSIs can employ a variety of biosignals to enable silent communication, such as electrophysiological recordings of neural activity, electromyographic (EMG) recordings of vocal tract movements or the direct tracking of articulator movements using imaging techniques. Depending on the disorder, some sensing techniques may be better suited than others to capture speech-related information. For instance, EMG and imaging techniques are well suited for laryngectomised patients, whose vocal tract remains almost intact but are unable to speak after the removal of the vocal folds, but fail for severely paralysed individuals. From the biosignals, SSIs decode the intended message, using automatic speech recognition or speech synthesis algorithms. Despite considerable advances in recent years, most present-day SSIs have only been validated in laboratory settings for healthy users. Thus, as discussed in this paper, a number of challenges remain to be addressed in future research before SSIs can be promoted to real-world applications. If these issues can be addressed successfully, future SSIs will improve the lives of persons with severe speech impairments by restoring their communication capabilities.

show abstract

Section: Comparison Of the Two Ssi Approachesmentioning

confidence: 99%

Section: ) Ssis Based On Emg Signalsmentioning

confidence: 99%

Section: ) Ssis Based On Emg Signalsmentioning

confidence: 99%

Section: ) Ssis Based On Emg Signalsmentioning

confidence: 99%

See 2 more Smart Citations

Silent Speech Interfaces for Speech Restoration: A Review

et al. 2020

View full text Add to dashboard Cite

show abstract

“…extremely noisy environments and/or military situations). For this automatic conversion task, typically electromagnetic articulography (EMA, [3,19,20]), ultrasound tongue imaging (UTI, [4,14,18,28]), permanent magnetic articulography (PMA, [10]), surface Electromyography (sEMG, [6,16,22]), lip video [1,7] and multimodal approaches are used [5]. Current SSI systems mostly apply the "direct synthesis" principle, where speech is generated without an intermediate step, directly from the articulatory data.…”

Section: Introductionmentioning

confidence: 99%

Applying DNN Adaptation to Reduce the Session Dependency of Ultrasound Tongue Imaging-based Silent Speech Interfaces

Gosztolya¹,

Grósz²,

Tóth³

et al. 2020

ACTA POLYTECH HUNG

View full text Add to dashboard Cite

Silent Speech Interfaces (SSI) perform articulatory-to-acoustic mapping to convert articulatory movement into synthesized speech. Its main goal is to aid the speech handicapped, or to be used as a part of a communication system operating in silencerequired environments or in those with high background noise. Although many previous studies addressed the speaker-dependency of SSI models, session-dependency is also an important issue due to the possible misalignment of the recording equipment. In particular, there are currently no solutions available, in the case of tongue ultrasound recordings. In this study, we investigate the degree of session-dependency of standard feed-forward DNNbased models for ultrasound-based SSI systems. Besides examining the amount of training data required for speech synthesis parameter estimation, we also show that DNN adaptation can be useful for handling session dependency. Our results indicate that by using adaptation, less training data and training time are needed to achieve the same speech quality over training a new DNN from scratch. Our experiments also suggest that the sub-optimal cross-session behavior is caused by the misalignment of the recording equipment, as adapting just the lower, feature extractor layers of the neural network proved to be sufficient, in achieving a comparative level of performance.

show abstract