Description Estimation and inference methods for models of conditional quantiles:Linear and nonlinear parametric and non-parametric (total variation penalized) models for conditional quantiles of a univariate response and several methods for handling censored survival data. Portfolio selection methods based on expected shortfall risk are also included.
Version 5.35Maintainer Roger Koenker
This paper presents a Chinese word segmentation system which can adapt to different domains and standards. We first present a statistical framework where domain-specific words are identified in a unified approach to word segmentation based on linear models. We explore several features and describe how to create training data by sampling. We then describe a transformation-based learning method used to adapt our system to different word segmentation standards. Evaluation of the proposed system on five test sets with different standards shows that the system achieves state-of-the-art performance on all of them. IntroductionChinese word segmentation has been a longstanding research topic in Chinese language processing. Recent development in this field shows that, in addition to ambiguity resolution and unknown word detection, the usefulness of a Chinese word segmenter also depends crucially on its ability to adapt to different domains of texts and different segmentation standards. The need of adaptation involves two research issues that we will address in this paper. The first is new word detection. Different domains/applications may have different vocabularies which contain new words/terms that are not available in a general dictionary. In this paper, new words refer to OOV words other than named entities, factoids and morphologically derived words. These words are mostly domain specific terms (e.g. 蜂窝式 'cellular') and time-sensitive political, social or cultural terms (e.g. 三通'Three Links', 非典 'SARS').The second issue concerns the customizable display of word segmentation. Different Chinese NLP-enabled applications may have different requirements that call for different granularities of word segmentation. For example, speech recognition systems prefer "longer words" to achieve higher accuracy whereas information retrieval systems prefer "shorter words" to obtain higher recall rates, etc. (Wu, 2003). Given a word segmentation specification (or standard) and/or some application data used as training data, a segmenter with customizable display should be able to provide alternative segmentation units according to the specification which is either pre-defined or implied in the data.In this paper, we first present a statistical framework for Chinese word segmentation, where various problems of word segmentation are solved simultaneously in a unified approach. Our approach is based on linear models where component models are inspired by the source-channel models of Chinese sentence generation. We then describe in detail how the new word identification (NWI) problem is handled in this framework. We explore several features and describe how to create training data by sampling. We evaluate the performance of our segmentation system using an annotated test set, where new words are simulated by sampling. We then describe a transformation-based learning (TBL, Brill, 1995) method that is used to adapt our system to different segmentation standards. We compare the adaptive system to other state-of-the-art systems using...
This letter presents a new discriminative model for Information Retrieval (IR), referred to as Ordinal Regression Model (ORM). ORM is different from most existing models in that it views IR as ordinal regression problem (i.e. ranking problem) instead of binary classification. It is noted that the task of IR is to rank documents according to the user information needed, so IR can be viewed as ordinal regression problem. Two parameter learning algorithms for ORM are presented. One is a perceptron-based algorithm. The other is the ranking Support Vector Machine (SVM). The effectiveness of the proposed approach has been evaluated on the task of ad hoc retrieval using three English Text REtrieval Conference (TREC) sets and two Chinese TREC sets. Results show that ORM significantly outperforms the state-of-the-art language model approaches and OKAPI system in all test sets; and it is more appropriate to view IR as ordinal regression other than binary classification.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.