This paper describes a technique which generates speech acoustics from articulator movements. Our motivation is to help people who can no longer speak following laryngectomy, a procedure which is carried out tens of thousands of times per year in the Western world. Our method for sensing articulator movement, Permanent Magnetic Articulography, relies on small, unobtrusive magnets attached to the lips and tongue. Changes in magnetic field caused by magnet movements are sensed and form the input to a process which is trained to estimate speech acoustics. In the experiments reported here this 'Direct Synthesis' technique is developed for normal speakers, with glued-on magnets, allowing us to train with parallel sensor and acoustic data. We describe three machine learning techniques for this task, based on Gaussian Mixture Models (GMMs), Deep Neural Networks (DNNs) and Recurrent Neural Networks (RNNs). We evaluate our techniques with objective acoustic distortion measures and subjective listening tests over spoken sentences read from novels (the CMU Arctic corpus). Our results show that the best performing technique is a bidirectional RNN (BiRNN), which employs both past and future contexts to predict the acoustics from the sensor data. BiRNNs are not suitable for synthesis in real-time but fixed-lag RNNs give similar results and, because they only look a little way into the future, overcome this problem. Listening tests show that the speech produced by this method has a natural quality which preserves the identity of the speaker. Furthermore, we obtain up to 92% intelligibility on the challenging CMU Arctic material. To our knowledge, these are the best results obtained for a silent-speech system without a restricted vocabulary and with an unobtrusive device that delivers audio in close to real time. This work promises to lead to a technology which truly will give people whose larynx has been removed their voices back.
Patients with larynx cancer often lose their voice following total laryngectomy. Current methods for postlaryngectomy voice restoration are all unsatisfactory due to different reasons: requires frequent replacement due to biofilm growth (tracheo-oesoephageal valve), speech sounds gruff and masculine (oesophageal speech) or robotic (electro-larynx) and, in general, are difficult to master (oesophageal speech and electro-larynx). In this work we investigate an alternative approach for voice restoration in which speech articulator movement is converted into audible speech using a speaker-dependent transformation learned from simultaneous recordings of articulatory and audio signals. To capture articulator movement, small magnets are attached to the speech articulators and the magnetic field generated while the user 'mouths' words is captured by a set of sensors. Parallel data comprising articulatory and acoustic signals recorded before laryngectomy are used to learn the mapping between the articulatory and acoustic domains, which is represented in this work as a mixture of factor analysers. After laryngectomy, the learned transformation is used to restore the patient's voice by transforming the captured articulator movement into an audible speech signal. Results reported for normal speakers show that the proposed system is very promising.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.