2022
DOI: 10.48550/arxiv.2202.08474
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Non-Autoregressive ASR with Self-Conditioned Folded Encoders

Abstract: This paper proposes CTC-based non-autoregressive ASR with self-conditioned folded encoders. The proposed method realizes non-autoregressive ASR with fewer parameters by folding the conventional stack of encoders into only two blocks; base encoders and folded encoders. The base encoders convert the input audio features into a neural representation suitable for recognition. This is followed by the folded encoders applied repeatedly for further refinement. Applying the CTC loss to the outputs of all encoders enfo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 17 publications
0
2
0
Order By: Relevance
“…Despite the promising performance, many works show the over-parameterization of transformers [8,9], which leads the models to require much memory storage during training and inference, and hence limits the usage of the models on-device. To reduce the memory cost, some works share the parameters of one or several transformer blocks so that the total number of the parameters of the model is much reduced [9,10,11,12,13]. These models use one or a few transformer blocks to encode features in a recursive manner, thus the number of parameters is less than the original transformers with the same depth.…”
Section: Introductionmentioning
confidence: 99%
“…Despite the promising performance, many works show the over-parameterization of transformers [8,9], which leads the models to require much memory storage during training and inference, and hence limits the usage of the models on-device. To reduce the memory cost, some works share the parameters of one or several transformer blocks so that the total number of the parameters of the model is much reduced [9,10,11,12,13]. These models use one or a few transformer blocks to encode features in a recursive manner, thus the number of parameters is less than the original transformers with the same depth.…”
Section: Introductionmentioning
confidence: 99%
“…https://arxiv.org/pdf/2201.10103v2.pdf --S-CFE CTC [159] https://arxiv.org/pdf/2202.08474v1.pdf --CASSNAT [119] https://arxiv.org/pdf/2010.14725v2.pdf --DLP [120] https://arxiv.org/pdf/2010.13270.pdf --CTC-enhanced [104] https://arxiv.org/pdf/2010.15025 --Align-Refine [111] https://aclanthology.org/2021.naacl-main.154.pdf https://github.com/amazon-research/align-refine To be released Align-Denoise [112] http://dx.doi.org/10.21437/Interspeech.2021-1906 https://github.com/bobchennan/espnet/tree Pytorch/Espnet LASO-BERT [121] https://arxiv.org/pdf/2102.07594 --P2M [160] https://arxiv.org/pdf/2104.02258 --Pre-train Comformer [161] https://arxiv.org/pdf/2104.03416v4.pdf --WNARS [162] https://arxiv.org/pdf/2104.03587v2.pdf --Improved CASS-NAT [163] https://arxiv.org/pdf/2106.09885v2.pdf --NAT-UBD [164] https://arxiv.org/pdf/2109.06684v1.pdf --Conformer-CIF [165] https://arxiv.org/pdf/2104.04702 --NAR-BERT-ASR [103] https://arxiv.org/pdf/2104.04805v1.pdf --Conditional-Multispk [166] https://arxiv.org/pdf/2106.08595v1.pdf https://github.com/pengchengguo/espnet Pytorch/Espnet Streaming NAR [167] https://arxiv.org/pdf/2107.09428v1.pdf https://github.com/espnet/espnet Pytorch/Espnet A-FMLM [168] https://arxiv.org/pdf/1911.04908.pdf --Mask-CTC [115] https://arxiv.org/pdf/2005.08700.pdf https://github.com/espnet/espnet Pytorch/Espnet KERMIT [169] https://arxiv.org/pdf/2005.13211.pdf https://github.com/espnet/espnet Pytorch/Espnet LSCO [170] https://arxiv.org/pdf/2005.04862v4.pdf --Spike-Triggered [171] https://arxiv.org/pdf/2005.07903v1.pdf --Intermediate CTC [116] https://arxiv.org/pdf/2102.03216v1.pdf https://github.com/espnet/espnet Pytorch/Espnet Self-Conditioned CTC [117] https://arxiv.org/pdf/2104.02724.pdf https://github.com/espnet/espnet Pytorch/Espnet Text to Speech BVAE-TTS [130] https://openreview.net/pdf?id=o3iritJHLfO https://github.com/LEEYOONHYUNG/BVAE-TTS Pytorch vTTS [172] https://arxiv.org/pdf/2203.14725.pdf --Gan-TTS [134] https://arxiv.org/pdf/2203.01080.pdf https://github.com/yanggeng1995/GAN-TTS Pytorch VARA-TTS [129] http...…”
mentioning
confidence: 99%