CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework based on Cascaded Transducer-Transformer

Zhanheng, Yang,; Sun, Sining; Li, Jin; Zhang, Xiaoming; Wang, Xiong; Ma, Lei; Xie, Lihua

doi:10.21437/interspeech.2022-10258

Cited by 7 publications

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, there has been increasing interest in unifying multi-stage modules into one single model. In this direction, Cascaded Transducer-Transformer (CATT-KWS) uses two-pass models, which unify streaming and non-streaming ASR approaches [19,20], to unify multistage KWS into one model [21]. Specifically, it uses the streaming part, which is originally used to generate streaming hypotheses, as the first-stage model to detect possible keywords, and then uses the non-streaming parts, which are originally used to re-score streaming hypotheses, as the validation stages for further verification of keywords detected in the first stage.…”

Section: Introductionmentioning

confidence: 99%

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge

Liang

Chen

et al. 2022

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)

View full text Add to dashboard Cite

The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundaryaware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.

show abstract

Section: Introductionmentioning

confidence: 99%