2004
DOI: 10.1145/1017460.1017462
|View full text |Cite
|
Sign up to set email alerts
|

Automatic information extraction from large websites

Abstract: Information extraction from websites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature.We present a novel approach to information extraction from websites, which reconciles recent proposals for s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
127
0

Year Published

2005
2005
2010
2010

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 146 publications
(127 citation statements)
references
References 29 publications
0
127
0
Order By: Relevance
“…In fact, wrappers of various sorts have been around as long as the web itself, and continue today, especially in the context of the deep web [AK97, Ku98,CM04]. This is partly an essential bootstrapping exercise: unless semantic content is sufficiently universal, then users will not rely on it, and if users do not expect it providers will not supply it; external meta-data and inference at the time of use can effectively transform the human web to semantic form and break the impasse.…”
Section: Meta-information On Human Web Sourcesmentioning
confidence: 99%
“…In fact, wrappers of various sorts have been around as long as the web itself, and continue today, especially in the context of the deep web [AK97, Ku98,CM04]. This is partly an essential bootstrapping exercise: unless semantic content is sufficiently universal, then users will not rely on it, and if users do not expect it providers will not supply it; external meta-data and inference at the time of use can effectively transform the human web to semantic form and break the impasse.…”
Section: Meta-information On Human Web Sourcesmentioning
confidence: 99%
“…• We can also mention WebL [19], RoadRunner [8], JEDI [18], the Garlic project (http://www.almaden.ibm.com/cs/garlic/adagency.html), NoDoSE [1], the University of Maryland Wrapper Generation Project [11], TSIMMIS [12] or LAPIS [21].…”
Section: Toolsmentioning
confidence: 99%
“…Here we applied screen scraping 12,21 techniques to fetch produce code, popular name, scientific name and description from the Brazilian Ministry of Agriculture Web portal. See Section 4 for details on these techniques.…”
Section: Data Acquisitionmentioning
confidence: 99%