2005
DOI: 10.1007/11581062_39
|View full text |Cite
|
Sign up to set email alerts
|

NET – A System for Extracting Web Data from Flat and Nested Data Records

Abstract: Abstract. This paper studies automatic extraction of structured data from Web pages. Each of such pages may contain several groups of structured data records. Existing automatic methods still have several limitations. In this paper, we propose a more effective method for the task. Given a page, our method first builds a tag tree based on visual information. It then performs a post-order traversal of the tree and matches subtrees in the process using a tree edit distance method and visual cues. After the proces… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
40
0
1

Year Published

2006
2006
2017
2017

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 58 publications
(42 citation statements)
references
References 19 publications
0
40
0
1
Order By: Relevance
“…It develops the new technique that correlates HTML pages and produces a wrapper with respect to their similarities and variations. Bing Liu, YanhongZhai [9], it explains the issue of automatic web data retrieval from several structured data records. They also explain how to segment the QRR, extracting the records from the data region and put them in Tabular form.…”
Section: Literature Reviewmentioning
confidence: 99%
See 1 more Smart Citation
“…It develops the new technique that correlates HTML pages and produces a wrapper with respect to their similarities and variations. Bing Liu, YanhongZhai [9], it explains the issue of automatic web data retrieval from several structured data records. They also explain how to segment the QRR, extracting the records from the data region and put them in Tabular form.…”
Section: Literature Reviewmentioning
confidence: 99%
“…The case when QRR contains the multi valued attribute, then a few of the data values may not be arranged to other data values. The proposed system does not use the this type of arrangement before the data records are arranged as it is aligned in DeLa [8] and NET [9], it uses it later the data records are arranged. Using this arrangement before the data records are arranged, it makes them unsafe to optional attribute so due to what it makes the tag structure irregular.…”
Section:  Nested Structure Processingmentioning
confidence: 99%
“…This feature helps the parsers of search engines to interact with the web pages' contents more efficiently (Ma et al, 2003). One of the useful techniques is wrappers as specified by Palmieri et al (2004), and Liu and Zhai (2005). Wrappers are responsible for converting HTML documents into semantically meaningful XML files to simplify the operation of extracting data.…”
Section: Related Workmentioning
confidence: 99%
“…The suggested method by Park and Barbosa (2007) avoids those weaknesses by using the web data extractor algorithm which depends on clustering and the weighted tree matching metric to extract data. Liu and Zhai (2005) realised the importance of extracting data records that were retrieved from databases and displayed on web pages. They analysed the disadvantage of the approaches that were used for extracting data i.e., wrapper induction and automatic extraction, then they proposed a method called nested data extraction using tree matching and visual cues (NET) for extracting flat or nested data records automatically.…”
Section: Related Workmentioning
confidence: 99%
“…These nodes constitute a similar sub-tree and then are divided into different data region, Where each node corresponds to a data record, through the analysis of the DOM structure of the page define some extraction rules for data ex-traction. Based on MDR, Zhai Y [2], Liu B [3], Simon K [4], Lausen G and other algorithms have been proposed DEPTA, NET, and VIPER algorithm. These algorithms are all based on the analysis of DOM structure to define corresponding rules for extraction, which need to traverse a large number of DOM nodes and cost a lot of time.…”
Section: Introductionmentioning
confidence: 99%