Lecture Notes in Computer Science
DOI: 10.1007/978-3-540-74477-1_31
|View full text |Cite
|
Sign up to set email alerts
|

Crawling the Content Hidden Behind Web Forms

Abstract: Abstract. The crawler engines of today cannot reach most of the information contained in the Web. A great amount of valuable information is "hidden" behind the query forms of online databases, and/or is dynamically generated by technologies such as JavaScript. This portion of the web is usually known as the Deep Web or the Hidden Web. We have built DeepBot, a prototype hiddenweb crawler able to access such content. DeepBot receives as input a set of domain definitions, each one describing a specific data-colle… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
31
0
3

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 27 publications
(34 citation statements)
references
References 10 publications
0
31
0
3
Order By: Relevance
“…As far as we know, RoadRunner is the only automatic web data extraction system available for download. 2 Compared to our system, RoadRunner performs neither the region location stage nor the record division stage. Its function is comparable to the stage in our approach which extracts the individual attributes from each data record.…”
Section: Comparison With Roadrunnermentioning
confidence: 98%
See 2 more Smart Citations
“…As far as we know, RoadRunner is the only automatic web data extraction system available for download. 2 Compared to our system, RoadRunner performs neither the region location stage nor the record division stage. Its function is comparable to the stage in our approach which extracts the individual attributes from each data record.…”
Section: Comparison With Roadrunnermentioning
confidence: 98%
“…3. We define the column similarity between t i and t j , denoted cs(t i , t j ), as the inverse of the average absolute error between the columns corresponding to t i and t j in the similarity matrix (2). Therefore, to consider two subtrees as similar, the column similarity measure requires their columns in the similarity matrix to be very similar.…”
Section: Grouping the Subtreesmentioning
confidence: 99%
See 1 more Smart Citation
“…No page down is allowed. 2 http://www.amazon.com/Best-Sellers/zgbs may prefer a small k to (1) speed up query processing and shorten the returned webpage, and/or (2) thwart web/tuple scraping. However, in order to accommodate the needs of website users, the value of k should not be too small.…”
Section: Introduction a Problem Motivationmentioning
confidence: 99%
“…Several papers have dealt with the problem of crawling and downloading information present in hidden text based databases [10]- [12]. [13]- [15] deal with extracting data from structured hidden databases. [16] and [17] use query based sampling methods to generate content summaries with relative and absolute frequencies while [18], [19] uses two phase sampling method on text based interfaces.…”
Section: Analysis Of Alert-hybridmentioning
confidence: 99%