This paper describes an automatic acquisition method for hyponymy relations.Hyponymy relations play a crucial role in various natural language processing systems, and there have been many attempts to automatically acquire the relations from largescale corpora.Most of the existing acquisition methods rely on particular linguistic patterns,such as juxtapositions,which specify hyponymy relations.Our method, however,does not use such linguistic patterns.We try to acquire hyponymy relations from four different types of clues.The first is repetitions of HTML tags found in usual HTML documents on the WWW.The second is statistical measures such as df and idf,which are popular in IR literatures.The third is verb-noun cooccurrences found in normal corpora.The fourth is heuristic rules obtained through our experiments on a development set.
We propose a variant of Convolutional Neural Network (CNN) models, the Attention CNN (ACNN); for large-scale categorization of millions of Japanese items into thirty-five product categories. Compared to a state-of-the-art Gradient Boosted Tree (GBT) classifier, the proposed model reduces training time from three weeks to three days while maintaining more than 96% accuracy. Additionally, our proposed model characterizes products by imputing attentive focus on word tokens in a language agnostic way. The attention words have been observed to be semantically highly correlated with the predicted categories and give us a choice of automatic feature extraction for downstream processing.
Due to the explosive growth in the amount of information in the last decade, it is getting extremely harder to obtain necessary information by conventional information access methods. Hence, creation of drastically new technology is needed. For developing such new technology, search engine infrastructures are required. Although the existing search engine APIs can be regarded as such infrastructures, these APIs have several restrictions such as a limit on the number of API calls. To help the development of new technology, we are running an open search engine infrastructure, TSUBAKI, on a high-performance computing environment. In this paper, we describe TSUBAKI infrastructure.
This paper describes a method to acquire hyponyms for given hypernyms from HTML documents on the WWW. We assume that a heading (or explanation) of an itemization (or listing) in an HTML document is likely to contain a hypernym of the items in the itemization, and we try to acquire hyponymy relations based on this assumption. Our method is obtained by extending Shinzato's method (Shinzato and Torisawa, 2004) where a common hypernym for expressions in itemizations in HTML documents is obtained by using statistical measures. By using Japanese HTML documents, we empirically show that our proposed method can obtain a significant number of hyponymy relations which would otherwise be missed by alternative methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.