2kenize: Tying Subword Sequences for Chinese Script Conversion

Pranav, A; Augenstein, Isabelle

doi:10.18653/v1/2020.acl-main.648

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020

DOI: 10.18653/v1/2020.acl-main.648

|View full text |Cite

2kenize: Tying Subword Sequences for Chinese Script Conversion

A Pranav¹,

Isabelle Augenstein

Abstract: Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have insufficient performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword seq… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2021

2022

Publication Types

Select...

Article2

Relationship

Self Cite0

Independent2

Authors

Journals

Cited by 2 publications

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Optimizing Word Segmentation for Downstream Tasks by Weighting Text Vector

Hiraoka

Takase

Uchiumi

et al. 2021

Journal of Natural Language Processing

View full text Add to dashboard Cite

In traditional NLP, we tokenize a sentence as a preprocessing, and thus the tokenization is unrelated to a downstream task. To address this issue, we propose a novel method to explore an appropriate tokenization for the downstream task. Our proposed method, Optimizing Tokenization (OpTok), is trained to assign a high probability to such appropriate tokenization based on the downstream task loss. Op-Tok can be used for any downstream task which uses a sentence vector representation such as text classification. Experimental results demonstrate that OpTok improves the performance of sentiment analysis, genre prediction, rating prediction, and textual entailment. The results also show that the proposed method is applicable to Chinese, Japanese, and English. In addition, we introduce OpTok into BERT, the state-ofthe-art contextualized embeddings, and report a positive effect on the performance.

show abstract

Optimizing Word Segmentation for Downstream Tasks by Weighting Text Vector

Hiraoka

Takase

Uchiumi

et al. 2021

Journal of Natural Language Processing

View full text Add to dashboard Cite

show abstract

Joint Optimization of Word Segmentation and Downstream Model using Downstream Loss

Hiraoka

Takase

Uchiumi

et al. 2022

Journal of Natural Language Processing

View full text Add to dashboard Cite

We propose a novel method to find an appropriate tokenization for a given downstream model by jointly optimizing a tokenizer and the model. The proposed method has no restriction except for using loss values computed by the downstream model to train the tokenizer, and thus, we can apply the proposed method to various NLP task. Moreover, the proposed method can explore the appropriate tokenization to improve the performance for an already trained model as post-processing. Therefore, the proposed method is applicable to various situations. We evaluated whether our method contributes to improving performance on text classification in three languages and machine translation in eight language pairs. Experimental results show that our proposed method improves the performance by determining appropriate tokenizations.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

2kenize: Tying Subword Sequences for Chinese Script Conversion

Cited by 2 publications

References 31 publications

Optimizing Word Segmentation for Downstream Tasks by Weighting Text Vector

Optimizing Word Segmentation for Downstream Tasks by Weighting Text Vector

Joint Optimization of Word Segmentation and Downstream Model using Downstream Loss

Contact Info

Product

Resources

About