Wrapper induction: Efficiency and expressiveness

Kushmerick, Nicholas

doi:10.1016/s0004-3702(99)00100-9

Cited by 424 publications

(292 citation statements)

References 14 publications

Supporting

Mentioning

288

Contrasting

Unclassified

Order By: Relevance

“…There are hopes that XML will solve this problem, but XML is not yet in widespread use and even in the best case it will only address the problem within application domains where the interested parties can agree on the XML schema definitions. Previous work on wrapper generation in both academic research [4,6,8] and commercial products (such as OnDisplay's eContent) have primarily focused on the ability to rapidly create wrappers. The previous work makes no attempt to ensure the accuracy of the wrappers over the entire set of pages of a site and provides no capability to detect failures and repair the wrappers when the underlying sources change.…”

Section: Introductionmentioning

confidence: 99%

Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

Knoblock

Lerman

Minton

et al. 2003

Studies in Fuzziness and Soft Computing

View full text Add to dashboard Cite

show abstract

Section: Introductionmentioning

confidence: 99%

Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

Knoblock

Lerman

Minton

et al. 2003

Studies in Fuzziness and Soft Computing

View full text Add to dashboard Cite

show abstract

“…Instead, it uses web feeds as a model that informs the process of generating extraction rules and it therefore resembles the Modelling-Based approaches. Hence, the approach presented in this paper can be positioned in relation to tools such as WIEN [9], Stalker [12], RoadRunner [4] or NoDoSE [1].…”

Section: Discussion and Related Workmentioning

confidence: 99%

“…The term wrapper induction is, in fact, coined by the authors [9] of the tool. However, as one of the earlier attempts, the use of the tool is restricted to a specific structure of the page and the heuristics of the presented data.…”

Section: Discussion and Related Workmentioning

confidence: 99%

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

et al. 2013

View full text Add to dashboard Cite

Abstract. Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives.

show abstract

“…Various techniques are proposed in the literature for Web data extraction: declarative languages [9], [2], wrapper induction [10], [16], deduction from ontologies [21].…”

Section: Related Workmentioning

confidence: 99%