Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.120
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing Word Segmentation for Downstream Task

Abstract: In traditional NLP, we tokenize a given sentence as a preprocessing, and thus the tokenization is unrelated to a target downstream task. To address this issue, we propose a novel method to explore a tokenization which is appropriate for the downstream task. Our proposed method, optimizing tokenization (Op-Tok), is trained to assign a high probability to such appropriate tokenization based on the downstream task loss. OpTok can be used for any downstream task which uses a vector representation of a sentence suc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
14
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 11 publications
(16 citation statements)
references
References 34 publications
0
14
0
Order By: Relevance
“…We focus on SentencePiece (SP) (Kudo and Richardson, 2018) and OpTok (Hiraoka et al, 2020) as other tokenizers for comparison with the proposed method. OpTok is a method to optimize tokenization for text classification by weighting a sentence vector with N -best tokenization.…”
Section: Text Classificationmentioning
confidence: 99%
See 3 more Smart Citations
“…We focus on SentencePiece (SP) (Kudo and Richardson, 2018) and OpTok (Hiraoka et al, 2020) as other tokenizers for comparison with the proposed method. OpTok is a method to optimize tokenization for text classification by weighting a sentence vector with N -best tokenization.…”
Section: Text Classificationmentioning
confidence: 99%
“…Much of prior research has reported that an appropriate tokenization depends on each downstream task (Xu et al, 2008;Chang et al, 2008;Nguyen et al, 2010;Domingo et al, 2018;Hiraoka et al, 2019;Gowda and May, 2020). Moreover, Hiraoka et al (2020) implies that we have to consider a downstream model to determine an appropriate tokenization. In other words, we can improve the performance of a downstream model by determining an appropriate tokenization for the downstream model.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…Much of prior research has reported that an appropriate tokenization depends on each downstream task (Xu et al, 2008;Chang et al, 2008;Nguyen et al, 2010;Domingo et al, 2018;Hiraoka et al, 2019;Gowda and May, 2020). Moreover, Hiraoka et al (2020) implies that we have to consider a downstream model to determine an appropriate tokenization. In other words, we can improve the performance of a downstream model by determining an appropriate tokenization for the downstream model.…”
Section: Introductionmentioning
confidence: 99%