ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8682617
|View full text |Cite
|
Sign up to set email alerts
|

Voice Trigger Detection from Lvcsr Hypothesis Lattices Using Bidirectional Lattice Recurrent Neural Networks

Abstract: We propose a method to reduce false voice triggers of a speech-enabled personal assistant by post-processing the hypothesis lattice of a server-side large-vocabulary continuous speech recognizer (LVCSR) via a neural network. We first discuss how an estimate of the posterior probability of the trigger phrase can be obtained from the hypothesis lattice using known techniques to perform detection, then investigate a statistical model that processes the lattice in a more explicitly data-driven, discriminative mann… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(14 citation statements)
references
References 8 publications
0
14
0
Order By: Relevance
“…Lattice embeddings are obtained by treating the lattice as a graph and processing it using multiple hidden layers of multi-headed self-attention operation. These embeddings have been shown to be highly informative for FTM task [12,1], but they can be obtained only by running full-fledged ASR on the audio which is expensive to be run on-device and invades user privacy in case of a false trigger. Moreover, the LatticeGNN model needs to be retrained if the distribution of the input lattice features changes due to any changes in the acoustic model, language model or the ASR decoding parameters.…”
Section: Latticegnn Ftm and Lattice Embeddingsmentioning
confidence: 99%
“…Lattice embeddings are obtained by treating the lattice as a graph and processing it using multiple hidden layers of multi-headed self-attention operation. These embeddings have been shown to be highly informative for FTM task [12,1], but they can be obtained only by running full-fledged ASR on the audio which is expensive to be run on-device and invades user privacy in case of a false trigger. Moreover, the LatticeGNN model needs to be retrained if the distribution of the input lattice features changes due to any changes in the acoustic model, language model or the ASR decoding parameters.…”
Section: Latticegnn Ftm and Lattice Embeddingsmentioning
confidence: 99%
“…Although there has been much work on sequential classification [22] and trigger word spotting [23], [24], [25] in the speech recognition literature, there have been few studies of sequential classification of continuous RF data time streams [26], [27]. Moreover, these studies focus exclusively on gross body motion classification.…”
Section: Sequential Classificationmentioning
confidence: 99%
“…The voice trigger is similar to keyword spotting that detects one or more predefined keywords from a sequence of speech signals [31,32]. It is widely used to wake up personal assistant devices.…”
Section: Conventional Voice Triggermentioning
confidence: 99%