Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10258
|View full text |Cite
|
Sign up to set email alerts
|

CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework based on Cascaded Transducer-Transformer

Abstract: Most of the existing systems designed for keyword spotting (KWS) rely on a predefined set of keyword phrases. However, the ability to recognize customized keywords is crucial for tailoring interactions with intelligent devices. In this paper, we present a novel framework for customized KWS. This framework leverages the hardware-efficient LiCoNet architecture as the encoder, enhanced by a spectral-temporal pooling layer and a hybrid loss function to facilitate effective word embedding learning. The experimental… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(1 citation statement)
references
References 0 publications
0
1
0
Order By: Relevance
“…Recently, there has been increasing interest in unifying multi-stage modules into one single model. In this direction, Cascaded Transducer-Transformer (CATT-KWS) uses two-pass models, which unify streaming and non-streaming ASR approaches [19,20], to unify multistage KWS into one model [21]. Specifically, it uses the streaming part, which is originally used to generate streaming hypotheses, as the first-stage model to detect possible keywords, and then uses the non-streaming parts, which are originally used to re-score streaming hypotheses, as the validation stages for further verification of keywords detected in the first stage.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, there has been increasing interest in unifying multi-stage modules into one single model. In this direction, Cascaded Transducer-Transformer (CATT-KWS) uses two-pass models, which unify streaming and non-streaming ASR approaches [19,20], to unify multistage KWS into one model [21]. Specifically, it uses the streaming part, which is originally used to generate streaming hypotheses, as the first-stage model to detect possible keywords, and then uses the non-streaming parts, which are originally used to re-score streaming hypotheses, as the validation stages for further verification of keywords detected in the first stage.…”
Section: Introductionmentioning
confidence: 99%