Wake Word Detection with Streaming Transformers

Wang, Yiming; Lv, Hang; Povey, Daniel; Xie, Lei; Khudanpur, Sanjeev

doi:10.1109/icassp39728.2021.9414777

Cited by 20 publications

(7 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the training of our mask estimate network, we use the farfield data of the challenge as input and the near-field data as target. IPD features are calculated among three microphone pairs with indexes (1,4), (2,5) and (3,6). The frame length and hop length used in the short-time Fourier transform (STFT) are set to 32 and 16 ms, respectively.…”

Section: Setups Of the Video-assisted Mvdrmentioning

confidence: 99%

“…Therefore, the prediction accuracy of KWS strongly impacts the user experience of voice assistants. Recent works on KWS have gained tremendous success and the KWS systems based on audio modality usually perform well under clean-speech conditions [2,3,4]. However, their performance may degrade significantly under noisy conditions due to the interference in signal transmission and the complexity of acoustic environment [5,6,7].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting

Zhang¹,

Wang²,

Guo³

et al. 2023

Preprint

View full text Add to dashboard Cite

The performance of the keyword spotting (KWS) system based on audio modality, commonly measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. Therefore, audio-visual keyword spotting, which leverages complementary relationships over multiple modalities, has recently gained much attention. However, current studies mainly focus on combining the exclusively learned representations of different modalities, instead of exploring the modal relationships during each respective modeling. In this paper, we propose a novel visual modality enhanced end-to-end KWS framework (VE-KWS), which fuses audio and visual modalities from two aspects. The first one is utilizing the speaker location information obtained from the lip region in videos to assist the training of multi-channel audio beamformer. By involving the beamformer as an audio enhancement module, the acoustic distortions, caused by the far field or noisy environments, could be significantly suppressed. The other one is conducting crossattention between different modalities to capture the inter-modal relationships and help the representation learning of each modality. Experiments on the MSIP challenge corpus show that our proposed model achieves 2.79% false rejection rate and 2.95% false alarm rate on the Eval set, resulting in a new SOTA performance compared with the top-ranking systems in the ICASSP2022 MISP challenge.

show abstract

Section: Setups Of the Video-assisted Mvdrmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting

Zhang¹,

Wang²,

Guo³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…And the final weights ΘT are used for the new initialization to fine-tune the model. The LTH pruning searches for a low-complexity model from steps (10) to (13).…”

Section: Audio-visual Model Pruning Using Lth-ifmentioning

confidence: 99%

“…Arik et al [11] also applied the convolutional recurrent * corresponding author neural network (CRNN) architecture to single English keyword detection. With the achievements of Transformer [12] in the field of deep learning, several variants of Transformers for wake word detection are explored in [13]. Besides, more efficient networks have been also investigated by leveraging recent advances in differentiable neural architecture search [14].…”

Section: Introductionmentioning

confidence: 99%

A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning

Zhou¹,

Du²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Audio-only based wake word spotting (WWS) is challenging under noisy conditions due to the environmental interference in signal transmission. In this paper, we investigate on designing a compact audio-visual WWS system by utilizing the visual information to alleviate the degradation. Specifically, in order to use visual information, we first encode the detected lips to fixed-size vectors with Mo-bileNet and concatenate them with acoustic features followed by the fusion network for WWS. However, the audio-visual model based on neural network requires a large footprint and a high computational complexity. To meet the application requirements, we introduce neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF), to the single-modal and multi-modal models, respectively. Tested on our in-house corpus for audio-visual WWS in a home TV scene, the proposed audiovisual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions. Moreover, LTH-IF pruning can largely reduce the network parameters and computations with no degradation of WWS performance, leading to a potential product solution for the TV wake-up scenario.

show abstract

“…The decoder still has vanilla SA and cross attention layers since the decoder is not used during inference. While many of the streaming SA models use relative position encoding [22][24] [25], we use an absolute position encoding scheme from the original TF [21].…”

Section: Streaming Sa Layersmentioning

confidence: 99%

Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation

Garg¹,

Chang²,

Sigtia³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present a unified and hardware efficient architecture for two stage voice trigger detection (VTD) and false trigger mitigation (FTM) tasks. Two stage VTD systems of voice assistants can get falsely activated to audio segments acoustically similar to the trigger phrase of interest. FTM systems cancel such activations by using post trigger audio context. Traditional FTM systems rely on automatic speech recognition lattices which are computationally expensive to obtain on device. We propose a streaming transformer (TF) encoder architecture, which progressively processes incoming audio chunks and maintains audio context to perform both VTD and FTM tasks using only acoustic features. The proposed joint model yields an average 18% relative reduction in false reject rate (FRR) for the VTD task at a given false alarm rate. Moreover, our model suppresses 95% of the false triggers with an additional one second of posttrigger audio. Finally, on-device measurements show 32% reduction in runtime memory and 56% reduction in inference time compared to non-streaming version of the model.

show abstract

Wake Word Detection with Streaming Transformers

Cited by 20 publications

References 28 publications

VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting

VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting

A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning

Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation

Contact Info

Product

Resources

About