Keyphrases are widely used as a brief summary of documents. Since manual assignment is time-consuming, various unsupervised ranking methods based on importance scores are proposed for keyphrase extraction. In practice, the keyphrases of a document should not only be statistically important in the document, but also have a good coverage of the document. Based on this observation, we propose an unsupervised method for keyphrase extraction. Firstly, the method finds exemplar terms by leveraging clustering techniques, which guarantees the document to be semantically covered by these exemplar terms. Then the keyphrases are extracted from the document using the exemplar terms. Our method outperforms sate-of-the-art graphbased ranking methods (TextRank) by 9.5% in F1-measure.
This paper proposes an automatic scheme to extractChinese abbreviations and their corresponding definitions from large-scale anchor texts. This method is motivated by the observation that the more frequently two anchor texts point to the same web page, the more related they are. Since abbreviation-definition pairs are highly related, they can be extracted from these related words. Our method involves three steps. Firstly we utilize external statistical features to extract candidate abbreviation-definition pairs from anchor texts. Secondly we extract internal features from candidate pairs and adopt Conditional Random Fields (CRFs) to compute a score for each candidate pair. Finally we combine external and internal features to generate the final pairs. Experimental results show that this method can accurately extract Chinese abbreviation-definition pairs from anchor texts and combining both external and internal features is effective for extracting abbreviation-definition pairs.
Nowadays, user behavior analysis and collaborative filtering have drawn a large body of research in the machine learning community. The goal is either to enhance the user experience or discover useful information hidden in the data. In this article, we conduct extensive experiments on a Chinese input method data set, which keeps the word lists that users have used. Then, from the collaborative perspective, we aim to solve two tasks in natural language processing, that is, related word retrieval and new word detection. Motivated by the observation that two words are usually highly related to each other if they co-occur frequently in users' records, we propose a novel semantic relatedness measure between words that takes both user behaviors and collaborative filtering into consideration. We utilize this measure to perform related word retrieval and new word detection tasks. Experimental results on both tasks indicate the applicability and effectiveness of our method.
Traditional text classification methods make a basic assumption: the training and test set are homologous, while this naïve assumption may not hold in the real world, especially in the web environment. Documents on the web change from time to time, pre-trained model may be out of date when applied to new emerging documents. However some information of training set is nonetheless useful. In this paper we proposed a novel method to discover the constant common knowledge in both training and test set by transfer learning, then a model is built based on this knowledge to fit the distribution in test set. The model is reinforced iteratively by adding most confident instances in unlabeled test set to training set until convergence, which is a self-training process, preliminary experiment shows that our method achieves an approximately 8.92% improvement as compared to the standard supervised-learning method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.