VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition

Wang, Quan; López-Moreno, Ignacio; Sağlam, Mert; Wilson, Kevin; Chiao, Alan; Liu, Renjie; He, You; Li, Wei; Pelecanos, Jason; Nika, Marily; Gruenstein, Alexander

doi:10.21437/interspeech.2020-1193

Cited by 54 publications

(39 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this work, we propose a switching method between observed mixture and enhanced speech for overlapping speech. Similarly, a preceding work called Voice Filter Light [19] switched observed mixture and enhanced speech to improve ASR results.…”

Section: Related Workmentioning

confidence: 99%

Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition

et al. 2021

View full text Add to dashboard Cite

Although recent advances in deep learning technology improved automatic speech recognition (ASR), it remains difficult to recognize speech when it overlaps other people's voices. Speech separation or extraction is often used as a front-end to ASR to handle such overlapping speech. However, deep neural network-based speech enhancement can generate 'processing artifacts' as a side effect of the enhancement, which degrades ASR performance. For example, it is well known that single-channel noise reduction for non-speech noise (nonoverlapping speech) often does not improve ASR. Likewise, the processing artifacts may also be detrimental to ASR in some conditions when processing overlapping speech with a separation/extraction method, although it is usually believed that separation/extraction improves ASR. In order to answer the question 'Do we always have to separate/extract speech from mixtures?', we analyze ASR performance on observed and enhanced speech at various noise and interference conditions, and show that speech enhancement degrades ASR under some conditions even for overlapping speech. Based on these findings, we propose a simple switching algorithm between observed and enhanced speech based on the estimated signal-to-interference ratio and signal-to-noise ratio. We demonstrated experimentally that such a simple switching mechanism can improve recognition performance when processing artifacts are detrimental to ASR.

show abstract

Section: Related Workmentioning

confidence: 99%

Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition

et al. 2021

View full text Add to dashboard Cite

show abstract

“…This helps reduce computational cost and energy consumption, particularly in scenarios where a keyword detector is not preferable. VoiceFilter-Lite [17] is a singlechannel source separation model that runs on-device to preserve only the speech signals from a target user as part of a streaming speech recognition system. Similarly, Xue et al in [18] propose a method called speaker tracing buffer, which can track speaker information consistently across the chunk by extending a selfattention mechanism to maintain the speaker permutation information determined in previous chunks.…”

Section: Speech Separation and Discretizationmentioning

confidence: 99%

Configurable Privacy-Preserving Automatic Speech Recognition

Aloufi¹,

Haddadi²,

Boyle³

2021

Interspeech 2021

View full text Add to dashboard Cite

Voice assistive technologies have given rise to far-reaching privacy and security concerns. In this paper we investigate whether modular automatic speech recognition (ASR) can improve privacy in voice assistive systems by combining independently trained separation, recognition, and discretization modules to design configurable privacy-preserving ASR systems. We evaluate privacy concerns and the effects of applying various stateof-the-art techniques at each stage of the system, and report results using task-specific metrics (i.e. WER, ABX, and accuracy). We show that overlapping speech inputs to ASR systems present further privacy concerns, and how these may be mitigated using speech separation and optimization techniques. Our discretization module is shown to minimize paralinguistics privacy leakage from ASR acoustic models to levels commensurate with random guessing. We show that voice privacy can be configurable, and argue this presents new opportunities for privacy-preserving applications incorporating ASR.

show abstract

“…Morover, the non-causal nature of the convolutions and the bidirectional recurrent units makes these aforementioned approaches unsuitable for real-time, low-complexity applications. Recently, Voicefilter-lite, a real-time alternative to the Voicefilter has been proposed [16] to improve the performance of speech recognition systems in multi-talker situations. Although Voicefilter-lite showed impressive performance for overlapped speech recognition, it was not designed to improve human perception or intelligibility under such conditions, which is the need of the hour for real-time audio communication systems.…”

Section: Introductionmentioning

confidence: 99%

Personalized PercepNet: Real-time, Low-complexity Target Voice Separation and Enhancement

Giri¹,

Venkataramani²,

Valin³

et al. 2021

Preprint

View full text Add to dashboard Cite

The presence of multiple talkers in the surrounding environment poses a difficult challenge for real-time speech communication systems considering the constraints on network size and complexity. In this paper, we present Personalized Percep-Net, a real-time speech enhancement model that separates a target speaker from a noisy multi-talker mixture without compromising on complexity of the recently proposed PercepNet. To enable speaker-dependent speech enhancement, we first show how we can train a perceptually motivated speaker embedder network to produce a representative embedding vector for the given speaker. Personalized PercepNet uses the target speaker embedding as additional information to pick out and enhance only the target speaker while suppressing all other competing sounds. Our experiments show that the proposed model significantly outperforms PercepNet and other baselines, both in terms of objective speech enhancement metrics and human opinion scores.

show abstract

VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition

Cited by 54 publications

References 0 publications

Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition

Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition

Configurable Privacy-Preserving Automatic Speech Recognition

Personalized PercepNet: Real-time, Low-complexity Target Voice Separation and Enhancement

Contact Info

Product

Resources

About