Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1330
|View full text |Cite
|
Sign up to set email alerts
|

Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

Abstract: We consider the design of two-pass voice trigger detection systems. We focus on the networks in the second pass that are used to re-score candidate segments obtained from the firstpass. Our baseline is an acoustic model(AM), with BiLSTM layers, trained by minimizing the CTC loss. We replace the BiLSTM layers with self-attention layers. Results on internal evaluation sets show that self-attention networks yield better accuracy while requiring fewer parameters. We add an autoregressive decoder network on top of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
27
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5

Relationship

1
4

Authors

Journals

citations
Cited by 13 publications
(28 citation statements)
references
References 20 publications
1
27
0
Order By: Relevance
“…p(VoiceTriggerPhoneSeq | audio). Although useful as demonstrated in previous work [14,15], the phonetic branch cannot be used to score the payload, where we are not sure what the phonetic content in the audio will be. The discriminative branch on other hand is trained to perform sequence classification.…”
Section: Inferencementioning
confidence: 99%
See 4 more Smart Citations
“…p(VoiceTriggerPhoneSeq | audio). Although useful as demonstrated in previous work [14,15], the phonetic branch cannot be used to score the payload, where we are not sure what the phonetic content in the audio will be. The discriminative branch on other hand is trained to perform sequence classification.…”
Section: Inferencementioning
confidence: 99%
“…The first-stage comprises a low-power detector that processes streaming audio and is always-on [12,13]. If a detection is made at the first stage, the detector marks the start and end points of the purported keyword segment (Figure 1) and the segment is then re-scored by larger, more complex models [14,15]. Note that this paper is concerned only with the larger models in the second-pass.…”
Section: Modelmentioning
confidence: 99%
See 3 more Smart Citations