1999
DOI: 10.1016/s1389-1286(99)00052-3
|View full text |Cite
|
Sign up to set email alerts
|

Focused crawling: a new approach to topic-specific Web resource discovery

Abstract: The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
701
0
16

Year Published

2003
2003
2020
2020

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 1,106 publications
(719 citation statements)
references
References 26 publications
2
701
0
16
Order By: Relevance
“…The primary metric which was used to evaluate the performance of the crawling system was the harvest rate P(C), which is the percentage of the web pages crawled which are related to the domain. Most focused crawlers have used this metric [6], [7], [4]. The core improvement of our focused crawler derives from combining link structure analysis and content similarity.…”
Section: Evaluation and Experimental Resultsmentioning
confidence: 99%
“…The primary metric which was used to evaluate the performance of the crawling system was the harvest rate P(C), which is the percentage of the web pages crawled which are related to the domain. Most focused crawlers have used this metric [6], [7], [4]. The core improvement of our focused crawler derives from combining link structure analysis and content similarity.…”
Section: Evaluation and Experimental Resultsmentioning
confidence: 99%
“…The challenges include locating the data sources [16,17,18,19], learning and understanding the interface and the returned results so that query submission and data extraction can be automated [20,19,21,22].…”
Section: Related Workmentioning
confidence: 99%
“…A much more efficient way is to crawl intelligently to retrieve only Web pages that are likely to be about machine learning. Such focused crawling (also known as topic distillation) has been the subject of much recent research [12,71,91]. The crawler can be guided by supervised learning techniques, in form of a Web document classifier [12,91], or by reinforcement learning techniques [71].…”
Section: Exploring and Navigating The Webmentioning
confidence: 99%
“…Such focused crawling (also known as topic distillation) has been the subject of much recent research [12,71,91]. The crawler can be guided by supervised learning techniques, in form of a Web document classifier [12,91], or by reinforcement learning techniques [71]. A well-known example of a taxonomy constructed from such focused crawling was the machine learning portal at cora.justresearch.com.…”
Section: Exploring and Navigating The Webmentioning
confidence: 99%