Masumi Shirakawa scite author profile

Nishio

2017

ACM Trans. Inf. Syst.

Inverse Document Frequency (IDF) is widely accepted term weighting scheme whose robustness is supported by many theoretical justifications. However, applying IDF to word N-grams (or simply N-grams) of any length without relying on heuristics has remained a challenging issue. This article describes a theoretical extension of IDF to handle N-grams. First, we elucidate the theoretical relationship between IDF and information distance, a universal metric defined by the Kolmogorov complexity. Based on our understanding of this relationship, we propose N-gram IDF, a new IDF family that gives fair weights to words and phrases of any length. Based only on the magnitude relation of N-gram IDF weights, dominant N-grams among overlapping N-grams can be determined. We also propose an efficient method to compute the N-gram IDF weights of all N-grams by leveraging the enhanced suffix array and wavelet tree. Because the exact computation of N-gram IDF provably requires significant computational cost, we modify it to a fast approximation method that can estimate weight errors analytically and maintain application-level performance. Empirical evaluations with unsupervised/supervised key term extraction and web search query segmentation with various experimental settings demonstrate the robustness and language-independent nature of the proposed N-gram IDF.

Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts Using Extended Naive Bayes

Nakayama

IEEE Trans. Emerg. Topics Comput.

et al. 2015

This paper proposes a Wikipedia-based semantic similarity measurement method that is intended for real-world noisy short texts. Our method is a kind of Explicit Semantic Analysis (ESA) which adds a bag of Wikipedia entities (Wikipedia pages) to a text as its semantic representation and uses the vector of entities for computing the semantic similarity. Adding related entities to a text, not a single word or phrase, is a challenging practical problem because it usually consists of several subproblems, e.g. key term extraction from texts, related entity finding for each key term, and weight aggregation of related entities. Our proposed method solves this aggregation problem using extended Naive Bayes (ENB), a probabilistic weighting mechanism based on the Bayes' theorem. Our method is effective especially when the short text is semantically noisy, i.e., they contain some meaningless or misleading terms for estimating their main topic. Experimental results on Twitter message and Web snippet clustering revealed that our method outperformed ESA for noisy short texts. We also found that reducing the dimension of the vector to representative Wikipedia entities scarcely affected the performance while decreasing the vector size and hence the storage space and the processing time of computing the cosine similarity.

Detecting Local Events by Analyzing Spatiotemporal Locality of Tweets

Sugitani

et al. 2013

Probabilistic semantic similarity measurements for noisy short texts using Wikipedia entities

Nakayama

et al. 2013

This paper describes a novel probabilistic method of measuring semantic similarity for real-world noisy short texts like microblog posts. Our method adds related Wikipedia entities to a short text as its semantic representation and uses the vector of entities for computing semantic similarity. Adding related entities to texts is generally a compound problem that involves the extraction of key terms, finding related entities for each key term, and the aggregation of related entities. Explicit Semantic Analysis (ESA), a popular Wikipedia-based method, solves these problems by summing the weighted vectors of related entities. However, this heuristic weighting highly depends on the rule of majority decision and is not suited to short texts that contain few key terms but many noisy terms. The proposed probabilistic method synthesizes these procedures by extending naive Bayes and achieves robust estimates of related Wikipedia entities for short texts. Experimental results on short text clustering using Twitter data indicated that our method outperformed ESA for short texts containing noisy terms.

Concept vector extraction from Wikipedia category network

Nakayama

et al. 2009

The availability of machine readable taxonomy has been demonstrated by various applications such as document classication and information retrieval. One of the main topics of automated taxonomy extraction research is Web mining based statistical NLP and a signicant number of researches have been conducted. However, existing works on automatic dictionary building have accuracy problems due to the technical limitation of statistical NLP (Natural Language Processing) and noise data on the WWW. To solve these problems, in this work, we focus on mining Wikipedia, a large scale Web encyclopedia. Wikipedia has high-quality and huge-scale articles and a category system because many users in the world have edited and rened these articles and category system daily. Using Wikipedia, the decrease of accuracy deriving from NLP can be avoided. However, afliation relations cannot be extracted by simply descending the category system automatically since the category system in Wikipedia is not in a tree structure but a network structure. We propose concept vectorization methods which are applicable to the category network structured in Wikipedia.