Multi-Grained Chinese Word Segmentation

Gong, Chen; Li, Zhenghua; Zhang, Min; Jiang, Xinzhou

doi:10.18653/v1/d17-1072

Cited by 11 publications

(31 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given an input sentence, the task of MWS is to retrieve all words of different granularities, which can be naturally organized as a hierarchical tree structure as shown in Figure 1 (right). Gong et al (2017) propose several MWS approaches and show that treating MWS as constituent parsing leads to the best performance. They adopt the transition-based parser of Cross and Huang (2016), which greedily searches an optimal shift-reduce action sequence to build a tree.…”

Section: Graph-based Model With Local Lossmentioning

confidence: 99%

“…They adopt the transition-based parser of Cross and Huang (2016), which greedily searches an optimal shift-reduce action sequence to build a tree. In this work, instead of adopting the transition-based parser as Gong et al (2017), we employ the graph-based parser of Stern et al (2017) and replace the original global max-margin loss with local span-wise loss (Joshi et al, 2018;Teng and Zhang, 2018) as our basic MWS model due to two considerations: 1) the graph-based parser with local loss gains more efficiency without hurting the performance compared with the transitionbased parser and the graph-based parser with global loss, which will be discussed in Section 5.3; 2) Figure 2: Architecture of our MWS model. more importantly, this work aims to conduct in-depth study on a simple, efficient, and effective way to incorporate weakly labeled data for MWS.…”

Section: Graph-based Model With Local Lossmentioning

confidence: 99%

“…Due to the lack of manually labeled high-quality data, Gong et al (2017) construct a large-scale pseudo labeled MWS data by automatically converting existing heterogeneous SWS data in a pairwise way. They train their model with the pseudo labeled data and obtain promising performance on a small scale manually labeled MWS data.…”

Section: Mws With Weakly Labeled Datamentioning

confidence: 99%

“…Motivated by above perspectives, multi-grained word segmentation (MWS) is formally proposed by Gong et al (2017) as a useful and challenging direction for research on word segmentation. Given an input sentence, MWS aims to accommodate all words of different granularities with a hierarchical tree structure.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Multi-grained Chinese Word Segmentation with Weakly Labeled Data

Gong

Zou

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

Self Cite

View full text Add to dashboard Cite

In contrast with the traditional single-grained word segmentation (SWS), where a sentence corresponds to a single word sequence, multi-grained Chinese word segmentation (MWS) aims to segment a sentence into multiple word sequences to preserve all words of different granularities. Due to the lack of manually annotated MWS data, previous work train and tune MWS models only on automatically generated pseudo MWS data. In this work, we further take advantage of the rich word boundary information in existing SWS data and naturally annotated data from dictionary example (DictEx) sentences, to advance the state-of-the-art MWS model based on the idea of weak supervision. Particularly, we propose to accommodate two types of weakly labeled data for MWS, i.e., SWS data and DictEx data by employing a simple yet competitive graph-based parser with local loss. Besides, we manually annotate a high-quality MWS dataset according to our newly compiled annotation guideline, consisting of over 9,000 sentences from two types of texts, i.e., canonical newswire (NEWS) and non-canonical web (BAIKE) data for better evaluation. Detailed evaluation shows that our proposed model with weakly labeled data significantly outperforms the state-of-the-art MWS model by 1.12 and 5.97 on NEWS and BAIKE data in F1.

show abstract

Section: Graph-based Model With Local Lossmentioning

confidence: 99%

Section: Graph-based Model With Local Lossmentioning

confidence: 99%

Section: Mws With Weakly Labeled Datamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multi-grained Chinese Word Segmentation with Weakly Labeled Data

Gong

Zou

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

Self Cite

View full text Add to dashboard Cite

show abstract

“…The changing of F1-score as more source domains are introduced in three different orders: Max-, Min-, and Randselect. The red dotted line is the result reported byChen et al (2017) with the same model, trained on nine datasets. 1…”

mentioning

confidence: 97%

RethinkCWS: Is Chinese Word Segmentation a Solved Task?

Fu¹,

Liu²,

Zhang³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

The performance of the Chinese Word Segmentation (CWS) systems has gradually reached a plateau with the rapid development of deep neural networks, especially the successful use of large pre-trained models. In this paper, we take stock of what we have achieved and rethink what's left in the CWS task. Methodologically, we propose a finegrained evaluation for existing CWS systems, which not only allows us to diagnose the strengths and weaknesses of existing models (under the in-dataset setting), but enables us to quantify the discrepancy between different criterion and alleviate the negative transfer problem when doing multi-criteria learning. Strategically, despite not aiming to propose a novel model in this paper, our comprehensive experiments on eight models and seven datasets, as well as thorough analysis, could search for some promising direction for future research. We make all codes publicly available and release an interface that can quickly evaluate and diagnose user's models: https://github. com/neulab/InterpretEval.

show abstract

Hierarchical LSTM with char-subword-word tree-structure representation for Chinese named entity recognition

Gong

Xia

et al. 2020

Sci. China Inf. Sci.

View full text Add to dashboard Cite

Multi-Grained Chinese Word Segmentation

Cited by 11 publications

References 15 publications

Multi-grained Chinese Word Segmentation with Weakly Labeled Data

Multi-grained Chinese Word Segmentation with Weakly Labeled Data

RethinkCWS: Is Chinese Word Segmentation a Solved Task?

Hierarchical LSTM with char-subword-word tree-structure representation for Chinese named entity recognition

Contact Info

Product

Resources

About