Juan Raposo scite author profile

Juan Raposo

Sign up to set email alerts

|

5Publications

140Citation Statements Received

46Citation Statements Given

How they've been cited

How they cite others

Affiliations

University of A Coruña

Publications

Order By: Most citations

Semi-Automatic Wrapper Generation for Commercial Web Sources

Pan¹,

et al. 2002

View full text Add to dashboard Cite

Abstract:Semi-automatic wrapper generation tools aim to ease the task of building structured views over semi-structured web sources. But the wrapper generation techniques presented up to date are unable to properly deal with sources requiring complex navigational sequences for accessing data. In this paper, we present WARGO, a semiautomatic wrapper generation tool, which has been used by nonprogrammer staff to successfully wrap more than 700 commercial web sources in several industrial applications. We describe our approach for wrapper generation and show the difficulties found with other systems for wrapping this kind of sources.

Crawling the Content Hidden Behind Web Forms

¹

,

²

,

³

et al.

View full text Add to dashboard Cite

Abstract. The crawler engines of today cannot reach most of the information contained in the Web. A great amount of valuable information is "hidden" behind the query forms of online databases, and/or is dynamically generated by technologies such as JavaScript. This portion of the web is usually known as the Deep Web or the Hidden Web. We have built DeepBot, a prototype hiddenweb crawler able to access such content. DeepBot receives as input a set of domain definitions, each one describing a specific data-collecting task and automatically identifies and learns to execute queries on the forms relevant to them. In this paper we describe the techniques employed for building DeepBot and report the experimental results obtained when testing it with several real world data collection tasks.

Extracting lists of data records from semi-structured web pages

¹

,

²

,

³

et al. 2008

Data & Knowledge Engineering

View full text Add to dashboard Cite

Automated browsing in AJAX websites

¹

,

²

,

³

et al. 2011

Data & Knowledge Engineering

View full text Add to dashboard Cite

Finding and Extracting Data Records from Web Pages

¹

,

²

,

³

et al. 2008

J Sign Process Syst Sign Image Video Technol

View full text Add to dashboard Cite

Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot process these data in such powerful manners as information from databases. We propose a set of novel techniques for detecting structured records in a web page and extracting the data values that constitute them. Our method needs only an input page. It starts by identifying the data region of interest in the page. Then it is partitioned into records by using a clustering method that groups similar subtrees in the DOM tree of the page. Finally, the attributes of the data records are extracted by using a method based on multiple string alignment. We have tested our techniques with a high number of real web sources, obtaining high precision and recall values.

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Copyright © 2024 scite LLC. All rights reserved.

Made with 💙 for researchers

Part of the Research Solutions Family.