Focused crawling: a new approach to topic-specific Web resource discovery

Chakrabarti, Soumen; Berg, Martin van den; Dom, Byron

doi:10.1016/s1389-1286(99)00052-3

Cited by 1,106 publications

(719 citation statements)

References 26 publications

Supporting

Mentioning

701

Contrasting

Unclassified

Order By: Relevance

“…The primary metric which was used to evaluate the performance of the crawling system was the harvest rate P(C), which is the percentage of the web pages crawled which are related to the domain. Most focused crawlers have used this metric [6], [7], [4]. The core improvement of our focused crawler derives from combining link structure analysis and content similarity.…”

Section: Evaluation and Experimental Resultsmentioning

confidence: 99%

A Method for Focused Crawling Using Combination of Link Structure and Content Similarity

Jamali

Sayyadi

Hariri

et al. 2006

2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)

View full text Add to dashboard Cite

Abstract-The rapid growth of the world-wide web poses unprecedented scaling challenges for general-purpose crawlers and search engines. A focused crawler aims at selectively seek out pages that are relevant to a pre-defined set of topics. Besides specifying topics by some keywords, it is customary also to use some exemplary documents to compute the similarity of a given web document to the topic. In this paper we introduce a new hybride focused crawler, which uses link structure of documents as well as similarity of pages to the topic to crawl the web

show abstract

Section: Evaluation and Experimental Resultsmentioning

confidence: 99%

A Method for Focused Crawling Using Combination of Link Structure and Content Similarity

Jamali

Sayyadi

Hariri

et al. 2006

2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)

View full text Add to dashboard Cite

show abstract

“…The challenges include locating the data sources [16,17,18,19], learning and understanding the interface and the returned results so that query submission and data extraction can be automated [20,19,21,22].…”

Section: Related Workmentioning

confidence: 99%

TS-IDS Algorithm for Query Selection in the Deep Web Crawling

Wang

Lü

Chen

2014

Web Technologies and Applications

View full text Add to dashboard Cite

Abstract. The deep web crawling is the process of collecting data items inside a data source hidden behind searchable interfaces. Since the only method to access the data is by sending queries, one of the research challenges is the selection of a set of queries such that they can retrieve most of the data with minimal network traffic. This is a set covering problem that is NP-hard. The large size of the problem, in terms of both large number of documents and terms involved, calls for new approximation algorithms for efficient deep web data crawling. Inspired by the TF-IDF weighting measure in information retrieval, this paper proposes the TS-IDS algorithm that assigns an importance value to each document proportional to term size (TS), and inversely proportional to document size (IDS). The algorithm is extensively tested on a variety of datasets, and compared with the traditional greedy algorithm and the more recent IDS algorithm. We demonstrate that TS-IDS outperforms the greedy algorithm and IDS algorithm up to 33% and 29%, respectively. Our work also makes a contribution to the classic set covering problem by leveraging the long-tail distributions of the terms and documents in natural languages. Since long-tail distribution is ubiquitous in real world, our approach can be applied in areas other than the deep web crawling.

show abstract

“…A much more efficient way is to crawl intelligently to retrieve only Web pages that are likely to be about machine learning. Such focused crawling (also known as topic distillation) has been the subject of much recent research [12,71,91]. The crawler can be guided by supervised learning techniques, in form of a Web document classifier [12,91], or by reinforcement learning techniques [71].…”

Section: Exploring and Navigating The Webmentioning

confidence: 99%

“…Such focused crawling (also known as topic distillation) has been the subject of much recent research [12,71,91]. The crawler can be guided by supervised learning techniques, in form of a Web document classifier [12,91], or by reinforcement learning techniques [71]. A well-known example of a taxonomy constructed from such focused crawling was the machine learning portal at cora.justresearch.com.…”

Section: Exploring and Navigating The Webmentioning

confidence: 99%

Mining for Information Discovery on the Web: Overview and Illustrative Research

Yu¹,

Doan²

2004

Intelligent Technologies for Information Analysis

View full text Add to dashboard Cite

Summary. The Web has become a fertile ground for numerous research activities in mining. In this chapter we discuss research on finding targeted information on the Web. First, we briefly survey the research area. We focus in particular on two key issues: (a) mining to impose structures over Web data, for example by building taxonomies and portals, to aid in Web navigation, and (b) mining to build information processing systems, such as search engines, question answering systems, and data integration ones. Next, we describe two recent Web mining projects that illustrate the use of mining techniques to address the above two key issues. We conclude by briefly discussing novel research opportunities in the area of mining for information discovery on the Web.

show abstract

Focused crawling: a new approach to topic-specific Web resource discovery

Cited by 1,106 publications

References 26 publications

A Method for Focused Crawling Using Combination of Link Structure and Content Similarity

A Method for Focused Crawling Using Combination of Link Structure and Content Similarity

TS-IDS Algorithm for Query Selection in the Deep Web Crawling

Mining for Information Discovery on the Web: Overview and Illustrative Research

Contact Info

Product

Resources

About