ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053889
|View full text |Cite
|
Sign up to set email alerts
|

Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions

Abstract: We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models. We also demonstrate that this… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
118
0
2

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 220 publications
(120 citation statements)
references
References 31 publications
0
118
0
2
Order By: Relevance
“…We also experimented with other different architectures, such as several LSTM-based models, a combination of 1D-CNN and LSTM, a down-scaled version of the basecaller Bonito [20, 21], and variational window sizes to capture different sequencing speeds. However, the LSTM-based models take a long time to train, and the accuracies don’t improve significantly.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…We also experimented with other different architectures, such as several LSTM-based models, a combination of 1D-CNN and LSTM, a down-scaled version of the basecaller Bonito [20, 21], and variational window sizes to capture different sequencing speeds. However, the LSTM-based models take a long time to train, and the accuracies don’t improve significantly.…”
Section: Methodsmentioning
confidence: 99%
“…This could indicate that local features extracted convolution windows provide sufficient information for classification, and long-range dependencies extracted by the recurrent network only help by a small amount. 21], stacks of LSTM layers with variational window size, different hyperparameter tuning, and different training datasets. After full consideration of model size, speed, performance, and training time, we reported the best performing model architecture in the main paper.…”
Section: Model Architecture Experiments and Hyperparameter Tuningmentioning
confidence: 99%
“…It was first proposed in [40] and is widely used for 2D image analysis [41]- [44]. Recently, depthwise separable convolution has also been incorporated for processing speech signals [45], [46].…”
Section: A Depthwise Separable Convolutions For 1d Signalmentioning
confidence: 99%
“…CTC was developed for speech recognition and first applied to nanopore sequencing by Chiron [46], and was later adopted by various ONT basecallers. Bonito (https://github.com/nanoporetech/bonito) is ONT's most recent research basecaller: it uses a convolutional architecture based on QuartzNet [26], and is trained with CTC loss. In practice, Bonito uses Viterbi decoding, which simply takes the argmax of the logits and concatenates the resulting nucleotide and gap characters.…”
Section: Decoding the Most Likely Output Sequence Of A Neural Networkmentioning
confidence: 99%