2015
DOI: 10.1007/s11280-015-0331-7
|View full text |Cite
|
Sign up to set email alerts
|

Finding seeds to bootstrap focused crawlers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
21
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 22 publications
(21 citation statements)
references
References 21 publications
0
21
0
Order By: Relevance
“…al. [31] proposed a system that uses relevance feedback to gather seeds to bootstrap focused crawlers. It submits keyword search queries to Bing; extracts keywords from the result pages classified as relevant for the focus domain; and uses these keywords to construct new search queries.…”
Section: Search-based Discoverymentioning
confidence: 99%
See 1 more Smart Citation
“…al. [31] proposed a system that uses relevance feedback to gather seeds to bootstrap focused crawlers. It submits keyword search queries to Bing; extracts keywords from the result pages classified as relevant for the focus domain; and uses these keywords to construct new search queries.…”
Section: Search-based Discoverymentioning
confidence: 99%
“…Several focused crawling and discovery techniques could potentially be adapted for this problem. However, they all rely on the availability of an accurate domain-specific classifier [3,13,31]. This is an unrealistic assumption for the many application scenarios where experts must start with a small set of relevant websites, since a small sample is unlikely to be sufficient construct an accurate classifier.…”
Section: Introductionmentioning
confidence: 99%
“…Rather, most existing unsupervised approaches [2,8,9,29,30,34,35] can be applied either over a collection of result pages, or over a collection of detail pages. Other approaches rely on the same publishing pattern, but focus only on segmenting the result pages [28], or rely on the much weaker signals arising from the aligning of the labels of the fields on the search form directly against the labels of the data on the detail pages [37]; finally, there are several approaches that focus on the problem of finding redundancy [3,38] among several sites, but the problem quickly trespasses on that of integrating data coming from autonomous sources [3,5,6,23,38], a problem that is well known not to have a simple solution [14].…”
Section: Introduction and Overviewmentioning
confidence: 99%
“…The following related problems have been already tackled in literature and are beyond the scope of the present paper: finding deep Web sources [36,39,40,43]; filling the search fields with meaningful values to collect result pages [1,4,17,25,[31][32][33]42]; crawling paginated search result pages [21].…”
Section: Introduction and Overviewmentioning
confidence: 99%
“…To direct the crawl towards topically relevant websites (i.e., websites with content relevant to cultural heritage) we use an SVM classifier, which is trained by resorting to an equal number of positive and negative examples of websites that are used as input to the model builder component of ACHE. Subsequently, the seed finder [79] component is used to aid the process of locating initial seeds for the focused crawl on the clear web; this is achieved by combining the pre-built classification model with the topic-related user-provided query discussed above. Since the crawled websites may be anything from blog posts to organizational web pages and do not have a predetermined structure (unlike the social media pages), the collected content is only parsed to remove HTML markup and is stored as raw text in the Hydria data lake.…”
mentioning
confidence: 99%