SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Kudo, Takeo; Richardson, John T. E.

doi:10.18653/v1/d18-2012

Cited by 2,079 publications

(1,445 citation statements)

References 18 publications

Supporting

Mentioning

1,434

Contrasting

Unclassified

Order By: Relevance

“…Tabel 4 shows the WER results of these experiments together with a brief summary of best results from the literature. These include hybrid HMM systems as well as endto-end (E2E) systems using different model types, topologies and label units, such as byte pair encoding (BPE) and Senten-cePiece [30]. We refer readers to the original papers for more details.…”

Section: Resultsmentioning

confidence: 99%

The Rwth Asr System for Ted-Lium Release 2: Improving Hybrid Hmm With Specaugment

Zhou

Michel

Irie

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We present a complete training pipeline to build a state-ofthe-art hybrid HMM-based ASR system on the 2nd release of the TED-LIUM corpus. Data augmentation using SpecAugment is successfully applied to improve performance on top of our best SAT model using i-vectors. By investigating the effect of different maskings, we achieve improvements from SpecAugment on hybrid HMM models without increasing model size and training time. A subsequent sMBR training is applied to fine-tune the final acoustic model, and both LSTM and Transformer language models are trained and evaluated. Our best system achieves a 5.6% WER on the test set, which outperforms the previous state-of-the-art by 27% relative.

show abstract

Section: Resultsmentioning

confidence: 99%

The Rwth Asr System for Ted-Lium Release 2: Improving Hybrid Hmm With Specaugment

Zhou

Michel

Irie

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…The default used is Spacy. A SentencePiece tokenizer [15] is also provided by the library. Subword tokenization [16] [17], such as that provided by SentencePiece, has been used in many recent NLP breakthroughs [18] [19].…”

Section: Textmentioning

confidence: 99%

Fastai: A Layered API for Deep Learning

2020

View full text Add to dashboard Cite

fastai is a deep learning library which provides practitioners with high-level components that can quickly and easily provide state-of-the-art results in standard deep learning domains, and provides researchers with low-level components that can be mixed and matched to build new approaches. It aims to do both things without substantial compromises in ease of use, flexibility, or performance. This is possible thanks to a carefully layered architecture, which expresses common underlying patterns of many deep learning and data processing techniques in terms of decoupled abstractions. These abstractions can be expressed concisely and clearly by leveraging the dynamism of the underlying Python language and the flexibility of the PyTorch library. fastai includes: a new type dispatch system for Python along with a semantic type hierarchy for tensors; a GPU-optimized computer vision library which can be extended in pure Python; an optimizer which refactors out the common functionality of modern optimizers into two basic pieces, allowing optimization algorithms to be implemented in 4-5 lines of code; a novel 2-way callback system that can access any part of the data, model, or optimizer and change it at any point during training; a new data block API; and much more. We have used this library to successfully create a complete deep learning course, which we were able to write more quickly than using previous approaches, and the code was more clear. The library is already in wide use in research, industry, and teaching.

show abstract

“…The decoder part uses 4 1-D convolutional layers with kernel size=3 and output features of 256. Supervised labels and contextual text is encoded into 5k sub-word output vocabulary [21]. We use the AdaDelta algorithm [30] with fixed learning rate=1.0 and gradient clipping at 10.0 where total gradients are scaled by the number of utterances in each minibatch.…”

Section: Methodsmentioning

confidence: 99%

“…(ii) {X, Y w } ∈ D w is the weaklysupervised dataset where X and Y w are pairs of audio features and the corresponding contextual text. The targets Y s and Y w are sequences of sub-word units [21].…”

Section: Weakly Supervised Training 21 Datasetsmentioning

confidence: 99%

Training ASR Models By Generation of Contextual Information

Singh

Okhonko

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi-and weakly-supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly-supervised learning for speech recognition by using loosely related contextual information as a surrogate for ground-truth labels. For weakly supervised training, we use 50k hours of public English social media videos along with their respective titles and post text to train an encoder-decoder transformer model. Our best encoder-decoder models achieve an average of 20.8% WER reduction over a 1000 hours supervised baseline, and an average of 13.4% WER reduction when using only the weakly supervised encoder for CTC fine-tuning. Our results show that our setup for weak supervision improved both the encoder acoustic representations as well as the decoder language generation abilities.

show abstract

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Cited by 2,079 publications

References 18 publications

The Rwth Asr System for Ted-Lium Release 2: Improving Hybrid Hmm With Specaugment

The Rwth Asr System for Ted-Lium Release 2: Improving Hybrid Hmm With Specaugment

Fastai: A Layered API for Deep Learning

Training ASR Models By Generation of Contextual Information

Contact Info

Product

Resources

About