With the success of Web 2.0 we are witnessing a growing number of services and APIs exposed by Telecom, IT and content providers. Targeting the Web community and, in particular, Web application developers, service providers expose capabilities of their infrastructures and applications in order to open new markets and to reach new customer groups. However, due to the complexity of the underlying technologies, the last step, i.e., the consumption and integration of the offered services, is a non-trivial and timeconsuming task that is still a prerogative of expert developers. Although many approaches to lower the entry barriers for end users exist, little success has been achieved so far. In this paper, we introduce the OMELETTE 1 project and show how it addresses end-user-oriented telco mashup development. We present the goals of the project, describe its contributions, summarize current results, and describe current and future work.
Abstract-One of the challenges facing the current web is the efficient use of all the available information. The Web 2.0 phenomenon has favored the creation of contents by average users, and thus the amount of information that can be found for diverse topics has grown exponentially in the last years. Initiatives such as linked data are helping to build the Semantic Web, in which a set of standards are proposed for the exchange of data among heterogeneous systems. However, these standards are sometimes not used, and there are still plenty of websites that require naive techniques to discover their contents and services. This paper proposes an integrated framework for content and service discovery and extraction. The framework is divided into several layers where the discovery of contents and services is made in a representational stateless transfer system such as the web. It employs several web mining techniques as well as featureoriented modeling for the discovery of cross-cutting features in web resources. The framework is used in a scenario of electronic newspapers. An intelligent agent crawls the web for related news, and uses services and visits links automatically according to its goal. This scenario illustrates how the discovery is made at different levels and how the use of semantics helps implement an agent that performs high-level tasks.
Information extraction out of web pages, commonly known as screen scraping, is usually performed through wrapper induction, a technique that is based on the internal structure of HTML documents. As such, the main limitation of these kinds of techniques is that a generated wrapper is only useful for the web page it was designed for. To overcome this, in this paper it is proposed a system that generates first-order logic rules that can be used to extract data from web pages. These rules are based on visual features such as font size, elements positioning or types of contents. Thus, they do not depend on a document's internal structure, and are able to work on different sites. The system has been validated on a set of different web pages, showing very high precision and good recall, which validates the robustness and the generalization capabilities of the approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.