Extraction and integration of partially overlapping web sources

Bronzi, Mirko; Crescenzi, Valter; Merialdo, Paolo; Papotti, Paolo

doi:10.14778/2536206.2536209

Cited by 45 publications

(50 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An interesting solution to achieve high scalability in extraction and integration is by exploiting the redundancy of published information of multiple web sources [2], or engaging humans to improve the performances [3]. In [2] the authors observe that web sources that publish information about the same domain often show a redundancy at the schema level and a partial overlap at instance level.…”

Section: Related Work and Open Issuesmentioning

confidence: 99%

“…In [2] the authors observe that web sources that publish information about the same domain often show a redundancy at the schema level and a partial overlap at instance level. Aligning instances from different sources provides an automatic technique to address synergically both extraction and integration of data from web sources.…”

Section: Related Work and Open Issuesmentioning

confidence: 99%

“…We envision a framework that combines automatic data extraction and integration techniques [2,8] with a supervised approach [5,6] guided by the crowd.…”

Section: Architecturementioning

confidence: 99%

“…Different techniques can be adopted to reduce the number of generated rules and to discard rules that extract non relevant template nodes [2].…”

Section: Extracting and Matchingmentioning

confidence: 99%

“…Another attempt to scale the extraction and integration of web data is by relying on partially overlapping web sources [2], web sources that push redundant information at schema level and at instance level. They adopt a "lazy" approach so that the schema of multiple web sources of the same domain is learned during the extraction process, in a bottom up fashion.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Extraction and integration of web sources with humans and domain knowledge

Qiu

Luce

2014

Proceedings of the 23rd International Conference on World Wide Web

View full text Add to dashboard Cite

The extraction and integration of data from many web sources in different domains is an open issue. Two promising solutions take on this challenge: top down approaches rely on a domain knowledge that is manually crafted by an expert to guide the process and bottom up approaches try to infer the schema from many web sources to make sense of the extracted data. The first solutions scale over the number of web sources, but for settings with different domains, an expert has to manually craft an ontology for each domain. The second solutions do not require a domain expert, but high quality is achieved only with a lot of human interactions both in the extraction and integration steps.We introduce a framework that takes the best from both approaches. The framework addresses synergically both extraction and integration of data from web sources. No domain expert is required, it exploits data from a seed knowledge base to enhance the automatic extraction and integration (top down). Human workers from crowdsourcing platforms are engaged to improve the quality and the coverage of the extracted data. The framework adopts techniques to automatically extract both the schema and the data from multiple web sources (bottom up). The extracted information is then used to bootstrap the seed knowledge base, reducing in this way the human effort for future tasks.

show abstract

Section: Related Work and Open Issuesmentioning

confidence: 99%

Section: Related Work and Open Issuesmentioning

confidence: 99%

“…We envision a framework that combines automatic data extraction and integration techniques [2,8] with a supervised approach [5,6] guided by the crowd.…”

Section: Architecturementioning

confidence: 99%

“…Different techniques can be adopted to reduce the number of generated rules and to discard rules that extract non relevant template nodes [2].…”

Section: Extracting and Matchingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations