An Encoding Strategy Based Word-Character

Liu, Wei; Tongge, Xu; Xu, Qinghua; Song, Jiayu; Zu, Yueran

doi:10.18653/v1/n19-1247

Cited by 83 publications

(58 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Overall, in both R and P settings, ZEN outperforms BERT in all seven tasks, which clearly indicates the advantage of introducing n-grams into the encoding of character sequences. 13 This observation is similar to that from Dos Santos and Gatti (2014); Lample et al (2016); Bojanowski et al (2017); Liu et al (2019a). In detail, when compare R and P settings, 12 Most of the previous studies show their performance on the development set of the aforementioned tasks and we follow them to do so in order to provide a reference and comparison.…”

Section: Overall Performancesupporting

confidence: 66%

ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations

Diao¹,

Bai

Yan

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

The pre-training of text encoders normally processes text as a sequence of tokens corresponding to small text units, such as word pieces in English and characters in Chinese. It omits information carried by larger text granularity, and thus the encoders cannot easily adapt to certain combinations of characters. This leads to a loss of important semantic information, which is especially problematic for Chinese because the language does not have explicit word boundaries. In this paper, we propose ZEN, a BERT-based Chinese (Z) text encoder Enhanced by N-gram representations, where different combinations of characters are considered during training, thus potential word or phrase boundaries are explicitly pre-trained and fine-tuned with the character encoder (BERT). Therefore ZEN incorporates the comprehensive information of both the character sequence and words or phrases it contains. Experimental results illustrated the effectiveness of ZEN on a series of Chinese NLP tasks, where state-of-the-art results is achieved on most tasks with requiring less resource than other published encoders. It is also shown that reasonable performance is obtained when ZEN is trained on a small corpus, which is important for applying pre-training techniques to scenarios with limited data. 1 * Work done during the internship at Sinovation Ventures.

show abstract

Section: Overall Performancesupporting

confidence: 66%

ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations

Diao¹,

Bai

Yan

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

show abstract

“…A drawback of the purely character-based NER model is that the word information is not fully exploited. To incorporate word information in Chinese NER, some recent methods, such as [10,11,12,13,14], resort to an automatically constructed lexicon.…”

Section: Chinese Nermentioning

confidence: 99%

Overview of CCKS 2020 Task 3: Named Entity Recognition and Event Extraction in Chinese Electronic Medical Records

Wen

et al. 2021

Data Intelligence

View full text Add to dashboard Cite

The China Conference on Knowledge Graph and Semantic Computing (CCKS) 2020 Evaluation Task 3 presented clinical named entity recognition and event extraction for the Chinese electronic medical records. Two annotated data sets and some other additional resources for these two subtasks were provided for participators. This evaluation competition attracted 354 teams and 46 of them successfully submitted the valid results. The pre-trained language models are widely applied in this evaluation task. Data argumentation and external resources are also helpful.

show abstract

“…Chinese word segmentation was performed first before applying character sequence labeling (Guo et al, 2004;Mao et al, 2008;Zhu and Wang, 2019). The pre-processing segmentation features included character positional embedding (Peng and Dredze, 2015;He and Sun, 2017a,b), segmentation tags Zhu and Wang, 2019), word embedding (Peng and Dredze, 2015;Liu et al, 2019;E and Xiang, 2017) and so on. The other was to train NER and CWS tasks jointly to incorporate task-shared word boundary information from the CWS into the NER (Xu et al, 2013;Peng and Dredze, 2016;Cao et al, 2018).…”

Section: Related Workmentioning

confidence: 99%

“…However, they treated the segmentations equally without error discrimination. Liu et al (2019) introduced four naive selection strategies to select words from the pre-prepared Lexicon for their model. However, these strategies did not consider the context of a sentence.…”

Section: Related Workmentioning

confidence: 99%

“…proposed a Lattice LSTM model that used the gated recurrent units to control the contribution of the potential words. However, as shown by Liu et al (2019), the gate mechanism might cause the model to degenerate into a partial word-based model. Ding et al (2019) and Gui et al (2019) proposed the models with graph neural network based on the information that the gazetteers or lexicons offered.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Incorporating Uncertain Segmentation Information into Chinese NER for Social Media Text

Jia¹,

Ding²,

Chen³

et al. 2020

Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media

View full text Add to dashboard Cite

Chinese word segmentation is necessary to provide word-level information for Chinese named entity recognition (NER) systems. However, segmentation error propagation is a challenge for Chinese NER while processing colloquial data like social media text. In this paper, we propose a model (UIcwsNN) that specializes in identifying entities from Chinese social media text, especially by leveraging uncertain information of word segmentation. Such ambiguous information contains all the potential segmentation states of a sentence that provides a channel for the model to infer deep word-level characteristics. We propose a trilogy (i.e., Candidate Position Embedding ⇒ Position Selective Attention ⇒ Adaptive Word Convolution) to encode uncertain word segmentation information and acquire appropriate word-level representation. Experimental results on the social media corpus show that our model alleviates the segmentation error cascading trouble effectively, and achieves a significant performance improvement of 2% over previous state-of-the-art methods.

show abstract

An Encoding Strategy Based Word-Character

Cited by 83 publications

References 30 publications

ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations

ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations

Overview of CCKS 2020 Task 3: Named Entity Recognition and Event Extraction in Chinese Electronic Medical Records

Incorporating Uncertain Segmentation Information into Chinese NER for Social Media Text

Contact Info

Product

Resources

About