2009
DOI: 10.14778/1687553.1687580
|View full text |Cite
|
Sign up to set email alerts
|

Scalable web data extraction for online market intelligence

Abstract: Online market intelligence (OMI), in particular competitive intelligence for product pricing, is a very important application area for Web data extraction. However, OMI presents non-trivial challenges to data extraction technology. Sophisticated and highly parameterized navigation and extraction tasks are required. On-the-fly data cleansing is necessary in order two identify identical products from different suppliers. It must be possible to smoothly define data flow scenarios that merge and filter streams of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
35
0
1

Year Published

2010
2010
2023
2023

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 40 publications
(36 citation statements)
references
References 14 publications
0
35
0
1
Order By: Relevance
“…Our approach can be categorized as wrapper induction: generation of extraction rules derived from a given set of training samples. This is fundamentally different from approaches where wrappers have to be defined by the user, even if this user is assisted by elaborate tools such as Lixto [3] or XWrap [4], because a wrapper may fail if the website changes its layout. Wrapper induction enables automatic adaptation to a new layout or even extracting from a previously unseen layout without human intervention.…”
Section: Related Researchmentioning
confidence: 99%
See 1 more Smart Citation
“…Our approach can be categorized as wrapper induction: generation of extraction rules derived from a given set of training samples. This is fundamentally different from approaches where wrappers have to be defined by the user, even if this user is assisted by elaborate tools such as Lixto [3] or XWrap [4], because a wrapper may fail if the website changes its layout. Wrapper induction enables automatic adaptation to a new layout or even extracting from a previously unseen layout without human intervention.…”
Section: Related Researchmentioning
confidence: 99%
“…While this XPath wrapper might work for one specific detail page, it is very inflexible, i.e., it does not generalize well to other detail pages. The other extreme is an XPath like //a [3] which selects the third anchor of a detail page. This XPath likely has a match on every detail page, but it is unlikely that it will select the same part of the template in every detail page.…”
Section: Xpath Generationmentioning
confidence: 99%
“…Ferrara et al [9] make a survey about the techniques for extracting information from specific sources, concluding that these techniques are domain-dependent and specifically designed for the existing problem and the information source type. Baumgartner et al [10] present a solution for the online market intelligence problem based on a novel web data extraction technology. Sánchez Torres and Palop [11] find a common feature in these technologies: the information is generally presented in reports that are disseminated and, then, considered as the input of knowledge creation process.…”
Section: Intelligent Organizationsmentioning
confidence: 99%
“…The problem addressed here is related to the one of designing proper wrappers to load contents of Web pages, such as Lixto 2 , that has been developed for extracting product pricing from already known Web sources [3]. Nevertheless, Web API documentation is contained in rather heterogeneous (in format and content) and unfamiliar HTML files, thus hampering the task of discovering useful data, i.e.…”
Section: A Web Api Model Extractormentioning
confidence: 99%