Chunyu Kit scite author profile

In this paper, the authors address the significance and complexity of tokenization, the beginning step of NLP. Notions of word and token are discussed and defined from the viewpoints of lexicography and pragmatic implementation, respectively. Automatic segmentation of Chinese words is presented as an illustration of tokenization. Practical approaches to identification of compound tokens in English, such as idioms, phrasal verbs and fixed expressions, are developed.

show abstract

Parsing syntactic and semantic dependencies with two single-stage maximum entropy models

Zhao

Kit

2008

View full text Add to dashboard Cite

This paper describes our system to carry out the joint parsing of syntactic and semantic dependencies for our participation in the shared task of CoNLL-2008. We illustrate that both syntactic parsing and semantic parsing can be transformed into a word-pair classification problem and implemented as a single-stage system with the aid of maximum entropy modeling. Our system ranks the fourth in the closed track for the task with the following performance on the WSJ+Brown test set: 81.44% labeled macro F1 for the overall task, 86.66% labeled attachment for syntactic dependencies, and 76.16% labeled F1 for semantic dependencies.

show abstract

Measuring mono-word termhood by rank difference via corpus comparison

Kit¹,

Liu²

2008

TERM

View full text Add to dashboard Cite

Measuring mono-word termhood by rank difference via corpus comparisonChunyu Kit and Xiaoyue Liu Terminology as a set of concept carriers crystallizes our special knowledge about a subject. Automatic term recognition (ATR) plays a critical role in the processing and management of various kinds of information, knowledge and documents, e.g., knowledge acquisition via text mining. Measuring termhood properly is one of the core issues involved in ATR. This article presents a novel approach to termhood measurement for mono-word terms via corpus comparison, which quantifies the termhood of a term candidate as its rank difference in a domain and a background corpus. Our ATR experiments to identify legal terms in Hong Kong (HK) legal texts with the British National Corpus (BNC) as background corpus provide evidence to confirm the validity and effectiveness of this approach. Without any prior knowledge and ad hoc heuristics, it achieves a precision of 97.0% on the top 1000 candidates and a precision of 96.1% on the top 10% candidates that are most highly ranked by the termhood measure, illustrating a state-of-the-art performance on mono-word ATR in the field.

show abstract

Cross language dependency parsing using a bilingual lexicon

Zhao

Yan

Kit

et al. 2009

View full text Add to dashboard Cite

This paper proposes an approach to enhance dependency parsing in a language by using a translated treebank from another language. A simple statistical machine translation method, word-by-word decoding, where not a parallel corpus but a bilingual lexicon is necessary, is adopted for the treebank translation. Using an ensemble method, the key information extracted from word pairs with dependency relations in the translated text is effectively integrated into the parser for the target language. The proposed method is evaluated in English and Chinese treebanks. It is shown that a translated English treebank helps a Chinese parser obtain a state-ofthe-art result.

show abstract

Integrating unsupervised and supervised word segmentation: The role of goodness measures

Zhang

Kit

2011

Information Sciences

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Chunyu Kit

Tokenization as the initial phase in NLP

Parsing syntactic and semantic dependencies with two single-stage maximum entropy models

Measuring mono-word termhood by rank difference via corpus comparison

Cross language dependency parsing using a bilingual lexicon

Integrating unsupervised and supervised word segmentation: The role of goodness measures

Contact Info

Product

Resources

About