Automatic information extraction from large websites

Crescenzi, Valter; Mecca, Giansalvatore

doi:10.1145/1017460.1017462

Cited by 146 publications

(127 citation statements)

References 29 publications

Supporting

Mentioning

127

Contrasting

Order By: Relevance

“…In fact, wrappers of various sorts have been around as long as the web itself, and continue today, especially in the context of the deep web [AK97, Ku98,CM04]. This is partly an essential bootstrapping exercise: unless semantic content is sufficiently universal, then users will not rely on it, and if users do not expect it providers will not supply it; external meta-data and inference at the time of use can effectively transform the human web to semantic form and break the impasse.…”

Section: Meta-information On Human Web Sourcesmentioning

confidence: 99%

From the web of data to a world of action

Dix

Lepouras

Katifori

et al. 2010

Journal of Web Semantics

View full text Add to dashboard Cite

Esta es la versión de autor del artículo publicado en: This is an author produced version of a paper published in: AbstractThis paper takes as its premise that the web is a place of action, not just information, and that the purpose of global data is to serve human needs. The paper presents several component technologies, which together work towards a vision where many small micro-applications can be threaded together using automated assistance to enable a unified and rich interaction. These technologies include data detector technology to enable any text to become a start point of semantic interaction; annotations for web-based services so that they can link data to potential actions; spreading activation over personal ontologies, to allow modelling of context; algorithms for automatically inferring 'typing' of web-form input data based on previous user inputs; and early work on inferring task structures from action traces. Some of these have already been integrated within an experimental web-based (extended) bookmarking tool, Snip!t, and a prototype desktop application On Time, and the paper discusses how the components could be more fully, yet more openly, linked in terms of both architecture and interaction. As well as contributing to the goal of an action and activity-focused web, the work also exposes a number of broader issues, theoretical, practical, social and economic, for the Semantic Web.

show abstract

Section: Meta-information On Human Web Sourcesmentioning

confidence: 99%

From the web of data to a world of action

Dix

Lepouras

Katifori

et al. 2010

Journal of Web Semantics

View full text Add to dashboard Cite

show abstract

“…• We can also mention WebL [19], RoadRunner [8], JEDI [18], the Garlic project (http://www.almaden.ibm.com/cs/garlic/adagency.html), NoDoSE [1], the University of Maryland Wrapper Generation Project [11], TSIMMIS [12] or LAPIS [21].…”

Section: Toolsmentioning

confidence: 99%

Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions

Barrero

Camacho

R-Moreno

2009

Data Mining and Multi-Agent Integration

View full text Add to dashboard Cite

Data Extraction from the World Wide Web is a well known, non solved, and a critical problem when complex information systems are designed. These problems are related to the extraction, management and reuse of the huge amount of Web data available. These data have usually a high heterogeneity, volatility and low quality (i.e. format and content mistakes), so it is quite hard to build realible systems. In this chapter we propose an updated state of the art revision of the problem of Web Data Extraction, and an Evolutionary Computation approach based on Genetic Algorithms and Regular Expressions to the problem of automatically learn software entities. These entities, also called wrappers, will be able to extract some kind of Web data structures from examples.

show abstract

“…Here we applied screen scraping 12,21 techniques to fetch produce code, popular name, scientific name and description from the Brazilian Ministry of Agriculture Web portal. See Section 4 for details on these techniques.…”

Section: Data Acquisitionmentioning

confidence: 99%

A standards-based framework to foster geospatial data and process interoperability

2009

View full text Add to dashboard Cite

Abstract:The quest for interoperability is one of the main driving forces behind international organizations such as OGC and W3C. In parallel, a trend in systems design and development is to break down GIS functionalities into modules that can be composed in an ad hoc manner. This component-driven approach increases flexibility and extensibility. For scientists whose research involves geospatial analysis, however, such initiatives mean more than interoperability and flexibility. These efforts are progressively shielding these users from having to deal with problems such as data representation formats, communication protocols or pre-processing algorithms. Once scientists are allowed to abstract from lower level concerns, they can shift their focus to the design and implementation of the computational models they are interested in. This paper analyzes how interoperability and componentization efforts have this underestimated impact on the design and development perspective. This discussion is illustrated by the description of the design and implementation of WebMAPS, a geospatial information system to support agricultural planning and monitoring. By taking advantage of new results in the above areas, the experience with WebMAPS presents a road map to leverage system design and development by the seamless composition of distributed data sources and processing solutions.

show abstract

Automatic information extraction from large websites

Cited by 146 publications

References 29 publications

From the web of data to a world of action

From the web of data to a world of action

Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions

A standards-based framework to foster geospatial data and process interoperability

Contact Info

Product

Resources

About