2013
DOI: 10.14778/2536206.2536209
|View full text |Cite
|
Sign up to set email alerts
|

Extraction and integration of partially overlapping web sources

Abstract: We present an unsupervised approach for harvesting the data exposed by a set of structured and partially overlapping data-intensive web sources. Our proposal comes within a formal framework tackling two problems: the data extraction problem, to generate extraction rules based on the input websites, and the data integration problem, to integrate the extracted data in a unified schema. We introduce an original algorithm, WEIR, to solve the stated problems and formally prove its correctness. WEIR leverages the ov… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
50
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 45 publications
(50 citation statements)
references
References 23 publications
0
50
0
Order By: Relevance
“…An interesting solution to achieve high scalability in extraction and integration is by exploiting the redundancy of published information of multiple web sources [2], or engaging humans to improve the performances [3]. In [2] the authors observe that web sources that publish information about the same domain often show a redundancy at the schema level and a partial overlap at instance level.…”
Section: Related Work and Open Issuesmentioning
confidence: 99%
See 4 more Smart Citations
“…An interesting solution to achieve high scalability in extraction and integration is by exploiting the redundancy of published information of multiple web sources [2], or engaging humans to improve the performances [3]. In [2] the authors observe that web sources that publish information about the same domain often show a redundancy at the schema level and a partial overlap at instance level.…”
Section: Related Work and Open Issuesmentioning
confidence: 99%
“…In [2] the authors observe that web sources that publish information about the same domain often show a redundancy at the schema level and a partial overlap at instance level. Aligning instances from different sources provides an automatic technique to address synergically both extraction and integration of data from web sources.…”
Section: Related Work and Open Issuesmentioning
confidence: 99%
See 3 more Smart Citations