Chinese word segmentation is a difficult, important and widely-studied sequence modeling problem. This paper demonstrates the ability of linear-chain conditional random fields (CRFs) to perform robust and accurate Chinese word segmentation by providing a principled framework that easily supports the integration of domain knowledge in the form of multiple lexicons of characters and words. We also present a probabilistic new word detection method, which further improves performance. Our system is evaluated on four datasets used in a recent comprehensive Chinese word segmentation competition. State-of-the-art performance is obtained.
Information extraction research at the University of Massachusetts is based on portable, trainable languag e processing components . Some components are more effective than others, some have been under developmen t longer than others, but in all cases, we are working to eliminate manual knowledge engineering . Although UMas s has participated in previous MUC evaluations, all of our information extraction software has been redesigned an d rewritten since MUC-5, so we are evaluating a completely new system this year .In particular, we are working with new string recognition specialists (for Named Entities), a new part-of-speech tagger, a new sentence analyzer, a new and fully automated dictionary construction algorithm, a new discours e analyzer, and a new coreference analyzer . The most interesting components in our system are CRYSTAL (whic h generates a concept node dictionary) [13], WRAP-UP (which establishes relational links between entities) [14, 15 , 16], and RESOLVE (the coreference analyzer) [8] . Each of these components utilizes machine learning techniques i n order to customize crucial extraction capabilities based on representative training texts .Our preparations for MUC-6 began on June 19 (at the release of the Call for Participation) and ended on Octobe r 2 when we began our test runs . All of our ST-specific training began in September with the release of the ST keys . As much as we try to exploit trainable technologies, there are nevertheless places where some amount of manua l coding is still needed . For example, we needed to write string manipulation functions to trim output strings in a n effort to generate slot fills consistent with the MUC-6 slot fill guidelines . We also needed to create template parser s and text marking interfaces in order to map the MUC-6 training documents into data usable by our trainabl e components.
Topic tracking is complicated when the stories in the stream occur in multiple languages. Typically, researchers have trained only English topic models because the training stories have been provided in English. In tracking, non-English test stories are then machine translated into English to compare them with the topic models. We propose a native language hypothesis stating that comparisons would be more effective in the original language of the story. We first test and support the hypothesis for story link detection. For topic tracking the hypothesis implies that it should be preferable to build separate language-specific topic models for each language in the stream. We compare different methods of incrementally building such native language topic models.
Text classification poses a significant challenge for knowledge-based technologies because it touches on all the familiar demons of artificial intelligence: the knowledge engineering bottleneck, problems of scale, easy portability across multiple applications, and cost-effective system construction. Information retrieval (IR) technologies traditionally avoid all of these issues by defining a document in terms of a statistical profile of its lexical items. The IR community is willing to exploit a superficial type of knowledge found in dictionaries and thesaurae, but anything that requires customization, application-specific engineering, or any amount of manual tinkering is thought to be incompatible with practical cost-effective system designs. In this paper those assumptions are challenged and it is shown how machine learning techniques can operate as an effective method for automated knowledge acquisition when it is applied to a representative training corpus, and leveraged against a few hours of routine work by a domain expert. A fully implemented text classification system operating on a medical testbed is described and experimental results based on that testbed are reported.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.