Progressive Voice Trigger Detection: Accuracy vs Latency

Sigtia, Siddharth; Bridle, John S.; Richards, Hywel; Clark, Pascal; Marchi, Erik; Garg, Vineet

doi:10.48550/arxiv.2010.15446

Cited by 1 publication

(3 citation statements)

References 12 publications

(25 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1. Block diagrams of (a) conventional multi-task learning for KWS [20,21] and (b) our proposed approach. In the conventional approach , a last layer is simply split into two branches, one for phoneme prediction and one for phrase prediction.…”

Section: Overviewmentioning

confidence: 99%

“…In the multi-task learning framework, the model is trained using both phonetic loss and phrase loss [3,20,21]. Let us assume that we sample N utterances for a mini-batch from a combined set of an ASR dataset and a KWS dataset.…”

Section: Multi-task Learningmentioning

confidence: 99%

“…Recently, multi-task learning has been applied to KWS [3,20,21] to better generalize models leveraging both large ASR and in-domain KWS datasets. In this framework, an output layer of the acoustic model is split into two branches for the two tasks.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multi-task Learning with Cross Attention for Keyword Spotting

Higuchi¹,

Gupta²,

Dhir³

2021

Preprint

View full text Add to dashboard Cite

Keyword spotting (KWS) is an important technique for speech applications, which enables users to activate devices by speaking a keyword phrase. Although a phoneme classifier can be used for KWS, exploiting a large amount of transcribed data for automatic speech recognition (ASR), there is a mismatch between the training criterion (phoneme recognition) and the target task (KWS). Recently, multi-task learning has been applied to KWS to exploit both ASR and KWS training data. In this approach, an output of an acoustic model is split into two branches for the two tasks, one for phoneme transcription trained with the ASR data and one for keyword classification trained with the KWS data. In this paper, we introduce a cross attention decoder in the multitask learning framework. Unlike the conventional multi-task learning approach with the simple split of the output layer, the cross attention decoder summarizes information from a phonetic encoder by performing cross attention between the encoder outputs and a trainable query sequence to predict a confidence score for the KWS task. Experimental results on KWS tasks show that the proposed approach outperformed the conventional multi-task learning with split branches and a bi-directional long short-team memory decoder by 12% on average.

show abstract

Section: Overviewmentioning

confidence: 99%

Section: Multi-task Learningmentioning

confidence: 99%