Kleinberg's Hypertext-Induced Topic Selection (HITS) algorithm is a popular and effective algorithm to rank web pages. One of its problems is the topic drift problem. Previous researches have tried to solve this problem using anchor-related text. In this paper, we investigate the effectiveness of using Semantic Text Portion for improving the HITS algorithm. In detail, we examine the degree to which we can improve the HITS algorithm. We also compare STPs with other kinds of anchorrelated text from the viewpoint of improving the HITS algorithm. The experimental results demonstrate that the use of STPs is best for improving the HITS algorithm.
A large number of semistructured documents exist on the web. We can find pages that contain keywords by using a search engine. But when we want to obtain information about an object like a notebook computer with 1GB memory, a method is needed that automatically extracts attribute name (in this example, "memory") and attribute value (in this example, "1GB"). In the past, many researchers examined extracting attribute values corresponding to each attribute name. This paper discribes a method that extracts schemas (sets of attribute names) using bootstrapping algorithm.
SummaryDirectory services are popular among people who search their favorite information on the Web. Those services provide hierarchical categories for finding a user's favorite page. Pages on the Web are categorized into one of the categories by hand. Many existing studies classify a web page by using text in the page. Recently, some studies use text not only from a target page which they want to categorize, but also from the original pages which link to the target page. We have to narrow down the text part in the original pages, because they include many text parts that are not related to the target page. However these studies always use a unique extraction method for all pages. Although web pages usually differ so much in their formats, they do not change their extraction methods. We have already developed an extraction method of anchor-related text. We use text parts extracted by our method for classifying web pages. The results of the experiments showed that our extraction method improves the classification accuracy.
Semantic Text Portion (STP) is a text portion in the original page which is semantically related to the anchor pointing to the target page. STPs may include the facts and the people's opinions about the target pages. STPs can be used for various upper-level applications such as automatic summarization and document categorization. In this paper, we concentrate on extracting STPs. We conduct a survey of STP to see the positions of STPs in original pages and find out HTML tags which can divide STPs from the other text portions in original pages. We then develop a method for extracting STPs based on the result of the survey. The experimental results show that our method achieves high performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.