Web data extraction based on partial tree alignment

Zhai, Yanhong; Liu, Bing

doi:10.1145/1060745.1060761

Cited by 396 publications

(316 citation statements)

References 35 publications

Supporting

Mentioning

308

Contrasting

Unclassified

Order By: Relevance

“…Instead of using nested tags (which have many errors) in the HTML code to build a tag tree, we build a tag tree based on the nested rectangles (see [13] for more details). bedded parsing and rendering engine of a browser, e.g., Internet explorer.…”

Section: Building the Tag Treementioning

confidence: 99%

“…Note also that a prototype may consist of multiple child nodes (not just one as N4) as a single data record may consist of multiple child nodes. To produce the prototype, in general we need to perform multiple alignments of data records [13]. However, we use a simpler method based on the extracted data (see below).…”

Section: Align Matched Data Items: Alignandlink()mentioning

confidence: 99%

“…We compare it with the most recent system DEPTA [13], which does not find nested data records. We show that for flat data records, NET performs as well as DEPTA.…”

Section: Empirical Evaluationmentioning

confidence: 99%

“…[7] and [15] propose some algorithms to identify data records, which do not extract data items from the data records, and do not handle nested data records. Our previous system DEPTA [13] is able to align and extract data items from data records, but does not handle nested data records. This paper proposes a more effective method to extract data from Web pages that contains a set of flat or nested data records automatically.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

NET – A System for Extracting Web Data from Flat and Nested Data Records

Liu

Zhai

2005

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. This paper studies automatic extraction of structured data from Web pages. Each of such pages may contain several groups of structured data records. Existing automatic methods still have several limitations. In this paper, we propose a more effective method for the task. Given a page, our method first builds a tag tree based on visual information. It then performs a post-order traversal of the tree and matches subtrees in the process using a tree edit distance method and visual cues. After the process ends, data records are found and data items in them are aligned and extracted. The method can extract data from both flat and nested data records. Experimental evaluation shows that the method performs the extraction task accurately.

show abstract

Section: Building the Tag Treementioning

confidence: 99%

Section: Align Matched Data Items: Alignandlink()mentioning

confidence: 99%

“…We compare it with the most recent system DEPTA [13], which does not find nested data records. We show that for flat data records, NET performs as well as DEPTA.…”

Section: Empirical Evaluationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

NET – A System for Extracting Web Data from Flat and Nested Data Records

Liu

Zhai

2005

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…EXALG [1] uses equivalence classes (sets of items that occur with the same frequency in sibling pages) and differentiating roles to generate extraction templates for the sibling pages. DEPTA [18] compares different records in a page instead of sibling pages and tries to find the extraction template for the record. Our system fundamentally differs from these approaches.…”

Section: Introductionmentioning

confidence: 99%

Automatic Hidden-Web Table Interpretation by Sibling Page Comparison

Tao¹,

Embley²

2007

Conceptual Modeling - ER 2007

View full text Add to dashboard Cite

Abstract. The longstanding problem of automatic table interpretation still illudes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction and semi-structured data management. In this paper, we offer a conceptual modeling solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. We compare them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2,000 tables in source pages from three different domains-car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns, if necessary, as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%.

show abstract

Visual webpage block importance prediction using conditional random fields

Tsai

Chiu

2011

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

We have developed a system that segments web pages into blocks and predicts those blocks' importance (block importance prediction or BIP). First, we use VIPS to partition a page into a tree composed of blocks and then extracts features from each block and labels all leaf nodes. This paper makes two main contributions. Firstly, we are pioneering the formulation of BIP as a sequence tagging task. We employ DFS, which outputs a single sequence for the whole tree in which related sub-blocks are adjacent. Our second contribution is using the conditional random fields (CRF) model for labeling these sequences. CRF's transition features model correlations between neighboring labels well, and CRF can simultaneously label all blocks in a sequence to find the global optimal solution for the whole sequence, not only the best solution for each block. In our experiments, our CRF-based system achieves an F1-measure of 97.41%, which significantly outperforms our ME-based baseline (95.64%). Lastly, we tested the CRF-based system using sites which were not covered in the training data. On completely novel sites CRF performed slightly worse than ME. However, when given only two training pages from a given site, CRF improved almost three times as much as ME.

show abstract

Web data extraction based on partial tree alignment

Cited by 396 publications

References 35 publications

NET – A System for Extracting Web Data from Flat and Nested Data Records

NET – A System for Extracting Web Data from Flat and Nested Data Records

Automatic Hidden-Web Table Interpretation by Sibling Page Comparison

Visual webpage block importance prediction using conditional random fields

Contact Info

Product

Resources

About