Inverse Document Frequency (IDF) is widely accepted term weighting scheme whose robustness is supported by many theoretical justifications. However, applying IDF to word N-grams (or simply N-grams) of any length without relying on heuristics has remained a challenging issue. This article describes a theoretical extension of IDF to handle N-grams. First, we elucidate the theoretical relationship between IDF and information distance, a universal metric defined by the Kolmogorov complexity. Based on our understanding of this relationship, we propose N-gram IDF, a new IDF family that gives fair weights to words and phrases of any length. Based only on the magnitude relation of N-gram IDF weights, dominant N-grams among overlapping N-grams can be determined. We also propose an efficient method to compute the N-gram IDF weights of all N-grams by leveraging the enhanced suffix array and wavelet tree. Because the exact computation of N-gram IDF provably requires significant computational cost, we modify it to a fast approximation method that can estimate weight errors analytically and maintain application-level performance. Empirical evaluations with unsupervised/supervised key term extraction and web search query segmentation with various experimental settings demonstrate the robustness and language-independent nature of the proposed N-gram IDF.
This paper proposes a Wikipedia-based semantic similarity measurement method that is intended for real-world noisy short texts. Our method is a kind of Explicit Semantic Analysis (ESA) which adds a bag of Wikipedia entities (Wikipedia pages) to a text as its semantic representation and uses the vector of entities for computing the semantic similarity. Adding related entities to a text, not a single word or phrase, is a challenging practical problem because it usually consists of several subproblems, e.g. key term extraction from texts, related entity finding for each key term, and weight aggregation of related entities. Our proposed method solves this aggregation problem using extended Naive Bayes (ENB), a probabilistic weighting mechanism based on the Bayes' theorem. Our method is effective especially when the short text is semantically noisy, i.e., they contain some meaningless or misleading terms for estimating their main topic. Experimental results on Twitter message and Web snippet clustering revealed that our method outperformed ESA for noisy short texts. We also found that reducing the dimension of the vector to representative Wikipedia entities scarcely affected the performance while decreasing the vector size and hence the storage space and the processing time of computing the cosine similarity.
This paper describes a novel probabilistic method of measuring semantic similarity for real-world noisy short texts like microblog posts. Our method adds related Wikipedia entities to a short text as its semantic representation and uses the vector of entities for computing semantic similarity. Adding related entities to texts is generally a compound problem that involves the extraction of key terms, finding related entities for each key term, and the aggregation of related entities. Explicit Semantic Analysis (ESA), a popular Wikipedia-based method, solves these problems by summing the weighted vectors of related entities. However, this heuristic weighting highly depends on the rule of majority decision and is not suited to short texts that contain few key terms but many noisy terms. The proposed probabilistic method synthesizes these procedures by extending naive Bayes and achieves robust estimates of related Wikipedia entities for short texts. Experimental results on short text clustering using Twitter data indicated that our method outperformed ESA for short texts containing noisy terms.
The availability of machine readable taxonomy has been demonstrated by various applications such as document classication and information retrieval. One of the main topics of automated taxonomy extraction research is Web mining based statistical NLP and a signicant number of researches have been conducted. However, existing works on automatic dictionary building have accuracy problems due to the technical limitation of statistical NLP (Natural Language Processing) and noise data on the WWW. To solve these problems, in this work, we focus on mining Wikipedia, a large scale Web encyclopedia. Wikipedia has high-quality and huge-scale articles and a category system because many users in the world have edited and rened these articles and category system daily. Using Wikipedia, the decrease of accuracy deriving from NLP can be avoided. However, afliation relations cannot be extracted by simply descending the category system automatically since the category system in Wikipedia is not in a tree structure but a network structure. We propose concept vectorization methods which are applicable to the category network structured in Wikipedia.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.