ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414777
|View full text |Cite
|
Sign up to set email alerts
|

Wake Word Detection with Streaming Transformers

Abstract: Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Transformer is not directly applicable to the task due to its non-streaming nature and the quadratic time and space c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 20 publications
(7 citation statements)
references
References 28 publications
0
7
0
Order By: Relevance
“…For the training of our mask estimate network, we use the farfield data of the challenge as input and the near-field data as target. IPD features are calculated among three microphone pairs with indexes (1,4), (2,5) and (3,6). The frame length and hop length used in the short-time Fourier transform (STFT) are set to 32 and 16 ms, respectively.…”
Section: Setups Of the Video-assisted Mvdrmentioning
confidence: 99%
See 1 more Smart Citation
“…For the training of our mask estimate network, we use the farfield data of the challenge as input and the near-field data as target. IPD features are calculated among three microphone pairs with indexes (1,4), (2,5) and (3,6). The frame length and hop length used in the short-time Fourier transform (STFT) are set to 32 and 16 ms, respectively.…”
Section: Setups Of the Video-assisted Mvdrmentioning
confidence: 99%
“…Therefore, the prediction accuracy of KWS strongly impacts the user experience of voice assistants. Recent works on KWS have gained tremendous success and the KWS systems based on audio modality usually perform well under clean-speech conditions [2,3,4]. However, their performance may degrade significantly under noisy conditions due to the interference in signal transmission and the complexity of acoustic environment [5,6,7].…”
Section: Introductionmentioning
confidence: 99%
“…And the final weights ΘT are used for the new initialization to fine-tune the model. The LTH pruning searches for a low-complexity model from steps (10) to (13).…”
Section: Audio-visual Model Pruning Using Lth-ifmentioning
confidence: 99%
“…Arik et al [11] also applied the convolutional recurrent * corresponding author neural network (CRNN) architecture to single English keyword detection. With the achievements of Transformer [12] in the field of deep learning, several variants of Transformers for wake word detection are explored in [13]. Besides, more efficient networks have been also investigated by leveraging recent advances in differentiable neural architecture search [14].…”
Section: Introductionmentioning
confidence: 99%
“…The decoder still has vanilla SA and cross attention layers since the decoder is not used during inference. While many of the streaming SA models use relative position encoding [22][24] [25], we use an absolute position encoding scheme from the original TF [21].…”
Section: Streaming Sa Layersmentioning
confidence: 99%