ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413806
|View full text |Cite
|
Sign up to set email alerts
|

Reducing Spelling Inconsistencies in Code-Switching ASR Using Contextualized CTC Loss

Abstract: Code-Switching (CS) remains a challenge for Automatic Speech Recognition (ASR), especially character-based models. With the combined choice of characters from multiple languages, the outcome from character-based models suffers from phoneme duplication, resulting in language-inconsistent spellings. We propose Contextualized Connectionist Temporal Classification (CCTC) loss to encourage spelling consistencies of a character-based nonautoregressive ASR which allows for faster inference. The model trained by CCTC … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 25 publications
0
3
0
Order By: Relevance
“…They are not only required to model the acoustic information from the speech signal but needed to generate a precise token sequence corresponding to the speech and the contextual coherence. In recent years, connectionist temporal classification (CTC)-based ASR systems [1] have attracted significant attention since they can achieve a much faster decoding speed in the non-autoregressive manner and obtain competitive or even better performance compared to the conventional auto-regressive models [2,3,4,5,6,7,8]. To be specific, a standard CTC-based ASR usually consists of a multi-layer Transformer-based acoustic encoder and a classification head based on some layers of simple feedforward neural network.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…They are not only required to model the acoustic information from the speech signal but needed to generate a precise token sequence corresponding to the speech and the contextual coherence. In recent years, connectionist temporal classification (CTC)-based ASR systems [1] have attracted significant attention since they can achieve a much faster decoding speed in the non-autoregressive manner and obtain competitive or even better performance compared to the conventional auto-regressive models [2,3,4,5,6,7,8]. To be specific, a standard CTC-based ASR usually consists of a multi-layer Transformer-based acoustic encoder and a classification head based on some layers of simple feedforward neural network.…”
Section: Introductionmentioning
confidence: 99%
“…These methods mainly focus on easing the conditional independence assumption from a theoretical perspective. The contextualized CTC loss [6] is proposed to guild the model learn contextualized information by introducing extra prediction heads to predict surrounding tokens. Some studies aspire to improve CTC-based ASR via knowledge transferring from pre-trained language models [7].…”
Section: Introductionmentioning
confidence: 99%
“…Sandberg (2022) stated that mastering written English is a complex process that varies for each student. This complexity requires students to map the phoneme-grapheme (sound-letter) relationships of regular or rule-following words and irregular or exceptional words by learning the formation of letter-sound connections, pronunciations, and word meanings, then retrieve the spelling from memory and perhaps code-switch, alternate between two or more languages or dialects, based upon native background (Arfé & Danzak, 2020;Ehri, 2014;Esposito et al, 2022;Naowarat et al, 2021). Failure to meet these requirements will result in spelling insufficiencies.…”
Section: Spelling Complexitymentioning
confidence: 99%