Whispered speech is a quiet voice without vocalization. One of the common cases of using whispered speech is a technique that can help overcome stuttering. But whispered speech can be uncomfortable and difficult to understand in everyday communication. To address these problems, we propose a method of low-delayed whisper-to-speech voice conversion, which can be useful in real life communication of people with disordered speech. As part of our research, we study the impact of streaming Automatic Speech Recognition models on the quality of voice conversion, comparing different streaming models and methods for model adaptation to streaming settings, and showing the importance of using such models in cases of low-delayed voice conversion.INDEX TERMS Speech recognition, voice conversion, disordered speech, whisper-to-speech processing.
I. INTRODUCTIONDespite the huge progress in developing speech processing tools for various types of disordered speech there is still room for improvement. In this research we concentrate on the stuttering problem. Based on our literature review there are only several works regarding the stuttering problem. These studies discover different aspects such as detecting stuttering type [1], recognizing and even synthesizing [2] stuttering speech. But in this investigation the focus is on a solution which can partially help to control stuttering. According to [3], one of the techniques which can help to overcome stuttering is whispered speech. But whispered speech lacks naturalness due to absence of the fundamental frequency (F0). Thus, we aim to create a system capable of transforming whispers into regular speech and apply this method to real-time processing.The majority of novel voice conversion (VC) systems adopt the following scheme. The whole system usually consists of three parts: an Automatic Speech Recognition (ASR) encoder for phonetic posteriorgrams (PPGs) extraction, a decoder taking PPGs features as input to predict mel spectrograms of target audio and a vocoder for synthesizing audio. The acoustic-phonetic distinctions between whispered and regular speech lead to substantial degradation of ASR systems [4]. However, according to [5] a small set of whispered or pseudo-whispered data used for adaptation brings significant improvements in ASR systems quality. Thus, the model trained on a large amount of speech can be easily adapted to the whispered domain. Also, the recent breakthrough in selfsupervised learning (SSL) allows to obtain well-performing ASR models having only a few hours of labeled data [6]. But, unfortunately, the design of SSL training makes streaming mode challenging for such models.This paper proposes the following contributions:r We demonstrate the ability of HuBERT [7] model pretrained with SSL to work in a streaming mode after an attention context masking or chunk-wise fine-tune training procedure.r We show the importance of using a streaming encoder model to improve the quality of low latency whisper-tospeech VC.r We propose an online VC system adapted to the whi...