Proceedings of the 14th International Conference on World Wide Web - WWW '05 2005
DOI: 10.1145/1060745.1060761
|View full text |Cite
|
Sign up to set email alerts
|

Web data extraction based on partial tree alignment

Abstract: This paper studies the problem of extracting data from a Web page that contains several structured data records. The objective is to segment these data records, extract data items/fields from them and put the data in a database table. This problem has been studied by several researchers. However, existing methods still have some serious limitations. The first class of methods is based on machine learning, which requires human labeling of many examples from each Web site that one is interested in extracting dat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
308
0
8

Year Published

2005
2005
2011
2011

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 396 publications
(316 citation statements)
references
References 35 publications
0
308
0
8
Order By: Relevance
“…Instead of using nested tags (which have many errors) in the HTML code to build a tag tree, we build a tag tree based on the nested rectangles (see [13] for more details). bedded parsing and rendering engine of a browser, e.g., Internet explorer.…”
Section: Building the Tag Treementioning
confidence: 99%
See 3 more Smart Citations
“…Instead of using nested tags (which have many errors) in the HTML code to build a tag tree, we build a tag tree based on the nested rectangles (see [13] for more details). bedded parsing and rendering engine of a browser, e.g., Internet explorer.…”
Section: Building the Tag Treementioning
confidence: 99%
“…Note also that a prototype may consist of multiple child nodes (not just one as N4) as a single data record may consist of multiple child nodes. To produce the prototype, in general we need to perform multiple alignments of data records [13]. However, we use a simpler method based on the extracted data (see below).…”
Section: Align Matched Data Items: Alignandlink()mentioning
confidence: 99%
See 2 more Smart Citations
“…EXALG [1] uses equivalence classes (sets of items that occur with the same frequency in sibling pages) and differentiating roles to generate extraction templates for the sibling pages. DEPTA [18] compares different records in a page instead of sibling pages and tries to find the extraction template for the record. Our system fundamentally differs from these approaches.…”
Section: Introductionmentioning
confidence: 99%