Optimizing Word Segmentation for Downstream Task

Hiraoka, Tatsuya; Takase, Sho; Uchiumi, Kei; Keyaki, Atsushi; Okazaki, Naoaki

doi:10.18653/v1/2020.findings-emnlp.120

“…We focus on SentencePiece (SP) (Kudo and Richardson, 2018) and OpTok (Hiraoka et al, 2020) as other tokenizers for comparison with the proposed method. OpTok is a method to optimize tokenization for text classification by weighting a sentence vector with N -best tokenization.…”

Section: Text Classificationmentioning

confidence: 99%

“…Much of prior research has reported that an appropriate tokenization depends on each downstream task (Xu et al, 2008;Chang et al, 2008;Nguyen et al, 2010;Domingo et al, 2018;Hiraoka et al, 2019;Gowda and May, 2020). Moreover, Hiraoka et al (2020) implies that we have to consider a downstream model to determine an appropriate tokenization. In other words, we can improve the performance of a downstream model by determining an appropriate tokenization for the downstream model.…”

Section: Introductionmentioning

confidence: 99%

“…Several studies have addressed the optimization of a tokenizer based on a downstream task or/and model (He et al, 2020;Hiraoka et al, 2020), but existing methods are restricted to specific tasks. He et al (2020) proposed DPE as a tokenization method for a sequence-to-sequence problem such as machine translation.…”

Section: Introductionmentioning

confidence: 99%

“…Their method trains a tokenizer with a given training corpus, but it is isolated from a downstream model such as a neural encoder-decoder for machine translation. Hiraoka et al (2020) proposed OpTok, which jointly trains a tokenizer and a downstream model. However, its architecture is specific to classification problems based on sentence representations, and thus, it cannot be applied for various tasks such as sequence-tosequence problems.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Joint Optimization of Tokenization and Downstream Model

Hiraoka

¹

,

Takase

²

,

Uchiumi

³

et al. 2021

Preprint

Self Cite

0

View full text Add to dashboard Cite

Since traditional tokenizers are isolated from a downstream task and model, they cannot output an appropriate tokenization depending on the task and model, although recent studies imply that the appropriate tokenization improves the performance. In this paper, we propose a novel method to find an appropriate tokenization to a given downstream model by jointly optimizing a tokenizer and the model. The proposed method has no restriction except for using loss values computed by the downstream model to train the tokenizer, and thus, we can apply the proposed method to any NLP task. Moreover, the proposed method can be used to explore the appropriate tokenization for an already trained model as post-processing. Therefore, the proposed method is applicable to various situations. We evaluated whether our method contributes to improving performance on text classification in three languages and machine translation in eight language pairs. Experimental results show that our proposed method improves the performance by determining appropriate tokenizations.

show abstract

“…Much of prior research has reported that an appropriate tokenization depends on each downstream task (Xu et al, 2008;Chang et al, 2008;Nguyen et al, 2010;Domingo et al, 2018;Hiraoka et al, 2019;Gowda and May, 2020). Moreover, Hiraoka et al (2020) implies that we have to consider a downstream model to determine an appropriate tokenization. In other words, we can improve the performance of a downstream model by determining an appropriate tokenization for the downstream model.…”

Section: Introductionmentioning

confidence: 99%

Joint Optimization of Tokenization and Downstream Model

Hiraoka¹,

Takase²,

Uchiumi³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Self Cite

View full text Add to dashboard Cite

Since traditional tokenizers are isolated from a downstream task and model, they cannot output an appropriate tokenization depending on the task and model, although recent studies imply that the appropriate tokenization improves the performance. In this paper, we propose a novel method to find an appropriate tokenization to a given downstream model by jointly optimizing a tokenizer and the model. The proposed method has no restriction except for using loss values computed by the downstream model to train the tokenizer, and thus, we can apply the proposed method to any NLP task. Moreover, the proposed method can be used to explore the appropriate tokenization for an already trained model as post-processing. Therefore, the proposed method is applicable to various situations. We evaluated whether our method contributes to improving performance on text classification in three languages and machine translation in eight language pairs. Experimental results show that our proposed method improves the performance by determining appropriate tokenizations.

show abstract