Mining data records in Web pages

Liu, Bing; Grossman, Robert L.; Zhai, Yanhong

doi:10.1145/956750.956826

Cited by 319 publications

(209 citation statements)

References 11 publications

Supporting

Mentioning

208

Contrasting

Unclassified

Order By: Relevance

“…They focus on two things those are, Data records recognition from the query page and next is arrange these extracted data in a table. Robert Grossman, YanhongZhai, Bing Liu [4], mainly focused on the data record which contains the large amount of information on the web. Data records also contain the information regarding their host pages for example list of product or services.…”

Section: Literature Reviewmentioning

confidence: 99%

“…In Record Extraction phase, firstly it identifies the data region which contains the number of query result records and then it does the segmentation of records [4]. Record alignment steps properly align the extracted data in a structured manner means it arrange the all the extracted QRR's in a table.…”

Section: System Overviewmentioning

confidence: 99%

“…In this section first we recognize the data regions in query page which contains numerous data records [4] [7]. Some child sub tree of the same parent node, here node is nothing but the HTML tags which forms data regions which is having data records.…”

Section: Fig 2 Tag Tree Of the Page  Data Region Identificationmentioning

confidence: 99%

See 2 more Smart Citations

Data Extraction and Alignment by using Combining Tag and Values Similarity

Pathak¹,

Chidrawar²

2017

International Journal of Advanced Research in Computer and Comm

View full text Add to dashboard Cite

Section: Literature Reviewmentioning

confidence: 99%

Section: System Overviewmentioning

confidence: 99%

See 1 more Smart Citation

Data Extraction and Alignment by using Combining Tag and Values Similarity

Pathak¹,

Chidrawar²

2017

International Journal of Advanced Research in Computer and Comm

View full text Add to dashboard Cite

“…At present, many issues in the field of deep Web data integration, such as interface integration [2] [3] and Web data extraction [4,5], have been widely studied. However, as a necessary step, identifying the duplicate entities(records) from multiple Web databases has not received due attention yet.…”

Section: Introductionmentioning

confidence: 99%

A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration

Liu

Meng

2010

2010 Sixth International Conference on Semantics, Knowledge and Grids

View full text Add to dashboard Cite

Abstract-The proliferation of deep Web offers users a great opportunity to search high-quality information from Web. As a necessary step in deep Web data integration, the goal of duplicate entity identification is to discover the duplicate records from the integrated Web databases for further applications(e.g. price-comparison services). However, most of existing works address this issue only between two data sources, which are not practical to deep Web data integration systems. That is, one duplicate entity matcher trained over two specific Web databases cannot be applied to other Web databases. In addition, the cost of preparing the training set for n Web databases is times higher than that for two Web databases. In this paper, we propose a holistic solution to address the new challenges posed by deep Web, whose goal is to build one duplicate entity matcher over multiple Web databases. The extensive experiments on two domains show that the proposed solution is highly effective for deep Web data integration.

show abstract

“…A similar method is proposed in [11]. [7] and [15] propose some algorithms to identify data records, which do not extract data items from the data records, and do not handle nested data records. Our previous system DEPTA [13] is able to align and extract data items from data records, but does not handle nested data records.…”

Section: Introductionmentioning

confidence: 99%

NET – A System for Extracting Web Data from Flat and Nested Data Records

Liu

Zhai

2005

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. This paper studies automatic extraction of structured data from Web pages. Each of such pages may contain several groups of structured data records. Existing automatic methods still have several limitations. In this paper, we propose a more effective method for the task. Given a page, our method first builds a tag tree based on visual information. It then performs a post-order traversal of the tree and matches subtrees in the process using a tree edit distance method and visual cues. After the process ends, data records are found and data items in them are aligned and extracted. The method can extract data from both flat and nested data records. Experimental evaluation shows that the method performs the extraction task accurately.

show abstract

Mining data records in Web pages

Cited by 319 publications

References 11 publications

Data Extraction and Alignment by using Combining Tag and Values Similarity

Data Extraction and Alignment by using Combining Tag and Values Similarity

A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration

NET – A System for Extracting Web Data from Flat and Nested Data Records

Contact Info

Product

Resources

About