Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations 2018
DOI: 10.18653/v1/d18-2012
|View full text |Cite
|
Sign up to set email alerts
|

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Abstract: This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
1,434
0
8

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 2,079 publications
(1,445 citation statements)
references
References 18 publications
3
1,434
0
8
Order By: Relevance
“…Tabel 4 shows the WER results of these experiments together with a brief summary of best results from the literature. These include hybrid HMM systems as well as endto-end (E2E) systems using different model types, topologies and label units, such as byte pair encoding (BPE) and Senten-cePiece [30]. We refer readers to the original papers for more details.…”
Section: Resultsmentioning
confidence: 99%
“…Tabel 4 shows the WER results of these experiments together with a brief summary of best results from the literature. These include hybrid HMM systems as well as endto-end (E2E) systems using different model types, topologies and label units, such as byte pair encoding (BPE) and Senten-cePiece [30]. We refer readers to the original papers for more details.…”
Section: Resultsmentioning
confidence: 99%
“…The default used is Spacy. A SentencePiece tokenizer [15] is also provided by the library. Subword tokenization [16] [17], such as that provided by SentencePiece, has been used in many recent NLP breakthroughs [18] [19].…”
Section: Textmentioning
confidence: 99%
“…The decoder part uses 4 1-D convolutional layers with kernel size=3 and output features of 256. Supervised labels and contextual text is encoded into 5k sub-word output vocabulary [21]. We use the AdaDelta algorithm [30] with fixed learning rate=1.0 and gradient clipping at 10.0 where total gradients are scaled by the number of utterances in each minibatch.…”
Section: Methodsmentioning
confidence: 99%
“…(ii) {X, Y w } ∈ D w is the weaklysupervised dataset where X and Y w are pairs of audio features and the corresponding contextual text. The targets Y s and Y w are sequences of sub-word units [21].…”
Section: Weakly Supervised Training 21 Datasetsmentioning
confidence: 99%