As WWW grows at an increasing speed, a classifier targeted at hypertext has become in high demand. While document categorization is quite a mature, the issue of utilizing hypertext structure and hyperlinks has been relatively unexplored. In this paper, we propose a practical method for enhancing both the speed and the quality of hypertext categorization using hyperlinks. In comparison against a recently proposed technique that appears to be the only one of the kind, we obtained up to 18.5% of improvement in effectiveness while reducing the processing time dramatically. We attempt to explain through experiments what factors contribute to the improvement.
This study investigated how effectively cause-effect information can be extracted from newspaper text using a simple computational method (i.e. without knowledge-based inferencing and without full parsing of sentences). An automatic method was developed for identifying and extracting cause-effect information in Wall Street Journal text using linguistic clues and pattern-matching. The set of linguistic patterns used for identifying causal relations was based on a thorough review of the literature and on an analysis of sample sentences from Wall Street Journal. The cause-effect information extracted using the method was compared with that identified by two human judges. The program successfully extracted about 68% of the causal relations identified by both judges (the intersection of the two sets of causal relations identified by the judges). Of the instances that the computer program identified as causal relations, about 25% were identified by both judges, and 64% were identified by at least one of the judges. Problems encountered are discussed.
In traditional information retrieval (IR) systems, a document as a whole is the target for a query. With increasing interests in structured documents like SGML documents, there is a growing need to build an LR system that can retrieve parts of documents, which satisfy not only content-based but also structure-based requirements.In this paper, we describe an inference-net-based approach to this problem. The model is capable of retrieving elements at any level in a principled way, satisfying certain containment constraints in a quety. Moreover, lvhile the model is general enough to reproduce the ranking strategy adopted by conventional document retrieval systems by making use of document and collection level statistics such as TF and IDF, its flexibility allows for incorporation of a variety of pragmatic and semantic information associated with document structures. We implemented the model and ran a series of experiments to show that, in addition to the added functionality, the use of the structural information embedded in SGML documents can improve the effectiveness of document retrieval, compared to the case where no such information is used. We also show that giving a pragmatic preference to a certain element tape of the SGML documents can enhance retrieval effectiveness.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.