We propose a supervised method of extracting event causalities like conduct slash-and-burn agriculture→exacerbate desertification from the web using semantic relation (between nouns), context, and association features. Experiments show that our method outperforms baselines that are based on state-of-the-art methods. We also propose methods of generating future scenarios like conduct slash-and-burn agriculture→exacerbate desertification→increase Asian dust (from China)→asthma gets worse.Experiments show that we can generate 50,000 scenarios with 68% precision. We also generated a scenario deforestation continues→global warming worsens→sea temperatures rise→vibrio parahaemolyticus fouls (water), which is written in no document in our input web corpus crawled in 2007. But the vibrio risk due to global warming was observed in Baker-Austin et al. (2013). Thus, we "predicted" the future event sequence in a sense.
The World Wide Web (WWW) allows a person to access a great amount of data provided by a wide variety of entities. However, the content varies widely in expression. This makes it difficult to browse many pages effectively, even if the contents of the pages are quite similar. This study is the first step toward the reduction of such variety of WWW contents. The method proposed in this paper enables us to easily obtain information about similar objects scattered over the WWW. We focus on the tables contained in the WWW pages and propose a method to integrate them according to the category of objects presented in each table. The table integrated in a uniform format enables us to easily compare the objects of different locations and styles of expressions.
In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging. Our word-character hybrid model offers high performance since it can handle both known and unknown words. We describe our strategies that yield good balance for learning the characteristics of known and unknown words and propose an errordriven policy that delivers such balance by acquiring examples of unknown words from particular errors in a training corpus. We describe an efficient framework for training our model based on the Margin Infused Relaxed Algorithm (MIRA), evaluate our approach on the Penn Chinese Treebank, and show that it achieves superior performance compared to the state-ofthe-art approaches reported in the literature.
This paper describes an automatic acquisition method for hyponymy relations.Hyponymy relations play a crucial role in various natural language processing systems, and there have been many attempts to automatically acquire the relations from largescale corpora.Most of the existing acquisition methods rely on particular linguistic patterns,such as juxtapositions,which specify hyponymy relations.Our method, however,does not use such linguistic patterns.We try to acquire hyponymy relations from four different types of clues.The first is repetitions of HTML tags found in usual HTML documents on the WWW.The second is statistical measures such as df and idf,which are popular in IR literatures.The third is verb-noun cooccurrences found in normal corpora.The fourth is heuristic rules obtained through our experiments on a development set.
We propose a method of acquiring attribute words for a wide range of objects from Japanese Web documents. The method is a simple unsupervised method that utilizes the statistics of words, lexico-syntactic patterns, and HTML tags. To evaluate the attribute words, we also establish criteria and a procedure based on question-answerability about the candidate word.1 We use C to denote both the class and its class label (the word representing the class). We also use A to denote both the attribute and the word representing it.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.