Sequence-to-sequence Models for Small-Footprint Keyword Spotting

Zhang, Haitong; Zhang, Junbo; Wang, Yujun

doi:10.48550/arxiv.1811.00348

Cited by 3 publications

(4 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A number of works have since used deep architectures suitable for sequence modelling (e.g. RNNs, CNNs, or graph convolutional networks) [6,15,23,31,36,38,45,52,60,64,76], including encoder-decoder approaches [8,51,73,78]. Berg et al [9] recently proposed using a Transformer model for the same task.…”

Section: Related Workmentioning

confidence: 99%

Visual Keyword Spotting with Attention

Prajwal

Momeni

Afouras

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we consider the task of spotting spoken keywords in silent video sequences -also known as visual keyword spotting. To this end, we investigate Transformerbased models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.

show abstract

Section: Related Workmentioning

confidence: 99%

Visual Keyword Spotting with Attention

Prajwal

Momeni

Afouras

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In order to generate background noise, we randomly sample and crop background noises provided in the dataset. For a fair comparison, in our test set, the "silence" class test samples are taken from open source speech commands dataset test set version 2 3 [19], and test samples of other classes are written in the officially released testing.list 12 .…”

Section: Datasetsmentioning

confidence: 99%

“…On the other hand, Deep neural networks (DNNs) have recently proven to yield efficient small-footprint solutions for KWS [8,9,10,11,12,13,14,15,16]. In particular, more advanced architectures, such as Convolutional Neural Networks (CNNs), have been applied to solve KWS problems under limited memory footprint as well as computational resource scenarios, showing excellent accuracy.…”

Section: Introductionmentioning

confidence: 99%

Text Anchor Based Metric Learning for Small-footprint Keyword Spotting

Wang¹,

Gu²,

Zou³

2021

Preprint

View full text Add to dashboard Cite

Keyword Spotting (KWS) remains challenging to achieve the trade-off between small footprint and high accuracy. Recently proposed metric learning approaches improved the generalizability of models for the KWS task, and 1D-CNN based KWS models have achieved the state-of-the-arts (SOTA) in terms of model size. However, for metric learning, due to data limitations, the speech anchor is highly susceptible to the acoustic environment and speakers. Also, we note that the 1D-CNN models have limited capability to capture long-term temporal acoustic features. To address the above problems, we propose to utilize text anchors to improve the stability of anchors. Furthermore, a new type of model (LG-Net) is exquisitely designed to promote long-short term acoustic feature modeling based on 1D-CNN and self-attention. Experiments are conducted on Google Speech Commands Dataset version 1 (GSCDv1) and 2 (GSCDv2). The results demonstrate that the proposed text anchor based metric learning method shows consistent improvements over speech anchor on representative CNN-based models. Moreover, our LG-Net model achieves SOTA accuracy of 97.67% and 96.79% on two datasets, respectively. It is encouraged to see that our lighter LG-Net with only 74k parameters obtains 96.82% KWS accuracy on the GSCDv1 and 95.77% KWS accuracy on the GSCDv2.

show abstract

“…RNNs are also combined with convolutional layers [7,25,27] to simultaneously model local features and temporal dependencies. Recent works also explore seq2seq models for KWS [9,31,45,47].…”

Section: Related Workmentioning

confidence: 99%

Seeing wake words: Audio-visual Keyword Spotting

Momeni,

Afouras,

Stafylakis

et al. 2020

Preprint

View full text Add to dashboard Cite

The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio. We propose a zero-shot method suitable for 'in the wild' videos. Our key contributions are: (1) a novel convolutional architecture, KWS-Net, that uses a similarity map intermediate representation to separate the task into (i) sequence matching, and (ii) pattern detection, to decide whether the word is there and when; (2) we demonstrate that if audio is available, visual keyword spotting improves the performance both for a clean and noisy audio signal. Finally, (3) we show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data, by fine-tuning the network pre-trained on English. The method exceeds the performance of the previous state-of-the-art visual keyword spotting architecture when trained and tested on the same benchmark, and also that of a state-of-the-art lip reading method.

show abstract

Sequence-to-sequence Models for Small-Footprint Keyword Spotting

Cited by 3 publications

References 11 publications

Visual Keyword Spotting with Attention

Visual Keyword Spotting with Attention

Text Anchor Based Metric Learning for Small-footprint Keyword Spotting

Seeing wake words: Audio-visual Keyword Spotting

Contact Info

Product

Resources

About