Decoupling recognition and transcription in Mandarin ASR

Yuan, Jiahong; Cai, Xingyu; Gao, Dongji; Zheng, Renjie; Huang, Liang; Church, Kenneth

doi:10.48550/arxiv.2108.01129

Cited by 1 publication

(3 citation statements)

References 50 publications

(55 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As shown in Table 1, we compare SCaLa with state-of-the-art ASR systems including hybrid [28], end-to-end [5,29], and selfsupervised learning [10,30]. Numerically, SCaLa outperforms the traditional CTC models [20] with 2.84% and 1.38% CER reductions on reading and spontaneous speech data, respectively.…”

Section: Resultsmentioning

confidence: 99%

“…4.3. Experimental results also show that SCaLa significantly outperforms hybrid chain models [28], end-to-end CTC-Conformer systems [29], self-supervised systems learning [10,30], and methods with phoneme masking [5].…”

Section: Resultsmentioning

confidence: 99%

“…Aishell-1 JD-tel Speaking style reading spontaneous Chain model [28] 7.45 15.94 Wav2vec2.0 [10,30] 5.30 16.30 WeNet CTC-conformer [29] w/ CTC prefix beam search 5.91 15.62 w/ attention rescoring 5.30 14.74 CTC [20] 6.74 15.34 CTC+phoneme mask [5] 5.11 14.80 SCaLa 3.90 13.96…”

Section: Testing Datamentioning

confidence: 99%

See 2 more Smart Citations

SCaLa: Supervised Contrastive Learning for End-to-End Speech Recognition

Li¹,

Li²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

End-to-end Automatic Speech Recognition (ASR) models are usually trained to reduce the losses of the whole token sequences, while neglecting explicit phonemic-granularity supervision. This could lead to recognition errors due to similar-phoneme confusion or phoneme reduction. To alleviate this problem, this paper proposes a novel framework of Supervised Contrastive Learning (SCaLa) to enhance phonemic information learning for end-to-end ASR systems. Specifically, we introduce the self-supervised Masked Contrastive Predictive Coding (MCPC) into the fully-supervised setting. To supervise phoneme learning explicitly, SCaLa first masks the variable-length encoder features corresponding to phonemes given phoneme forced-alignment extracted from a pre-trained acoustic model, and then predicts the masked phonemes via contrastive learning. The phoneme forced-alignment can mitigate the noise of positive-negative pairs in self-supervised MCPC. Experimental results conducted on reading and spontaneous speech datasets show that the proposed approach achieves 2.84% and 1.38% Character Error Rate (CER) reductions compared to the baseline, respectively.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%