The Challenge of using Map-reduce to Query Open Data

2021

Information

Self Cite

How to exploit the incredible variety of JSON data sets currently available on the Internet, for example, on Open Data portals? The traditional approach would require getting them from the portals, then storing them into some JSON document store and integrating them within the document store. However, once data are integrated, the lack of a query language that provides flexible querying capabilities could prevent analysts from successfully completing their analysis. In this paper, we show how the J-CO Framework, a novel framework that we developed at the University of Bergamo (Italy) to manage large collections of JSON documents, is a unique and innovative tool that provides analysts with querying capabilities based on fuzzy sets over JSON data sets. Its query language, called J-CO-QL, is continuously evolving to increase potential applications; the most recent extensions give analysts the capability to retrieve data sets directly from web portals as well as constructs to apply fuzzy set theory to JSON documents and to provide analysts with the capability to perform imprecise queries on documents by means of flexible soft conditions. This paper presents a practical case study in which real data sets are retrieved, integrated and analyzed to effectively show the unique and innovative capabilities of the J-CO Framework.

Section: Genesis Of the J-co Frameworkmentioning

confidence: 99%

Towards Flexible Retrieval, Integration and Analysis of JSON Data Sets through Fuzzy Sets: A Case Study

Fosci

2021

Information

Self Cite

“…We started our research on blind querying of Open Data corpora with [1]; the technique has been subsequently enhanced in [2]; this latter paper is the basis for the implementation of the Hammer prototype discussed in this paper. We previously discussed how to use Map-Reduce in the various components of the Hammer prototype in [3]; however, a deep analysis of performance was not performed. After these works, we think we are still the pioneers on this research line: it seems that no strictly related works are in literature.…”

Section: Related Workmentioning

confidence: 99%

“…In this step, instances of selected data sets are downloaded from the Open Data portal and filtered. As shown in [3], the Map-Reduce approach helps with parallelizing this task.…”

Section: Other Executors Implemented By Means Of Map-reducementioning

confidence: 99%

“…As an example, consider the generation of the inverted index. Given a data set ds and the set of terms appearing in its schema and meta-data, the Map phase builds pairs (term, ds_id); the Reduce phase aggregates all pairs into the inverted index, in such a way a term is associated with the list of identifiers of data sets that contain that term (in fact, we implemented the indexing phase by means of Map-Reduce as well [3], but its description is outside the scope of the paper).…”

mentioning

confidence: 99%

See 1 more Smart Citation

Hadoop vs. Spark: Impact on Performance of the Hammer Query Engine for Open Data Corpora

Pelucchi¹,

Toccu

2018

Algorithms

Self Cite

The Hammer prototype is a query engine for corpora of Open Data that provides users with the concept of blind querying. Since data sets published on Open Data portals are heterogeneous, users wishing to find out interesting data sets are blind: queries cannot be fully specified, as in the case of databases. Consequently, the query engine is responsible for rewriting and adapting the blind query to the actual data sets, by exploiting lexical and semantic similarity. The effectiveness of this approach was discussed in our previous works. In this paper, we report our experience in developing the query engine. In fact, in the very first version of the prototype, we realized that the implementation of the retrieval technique was too slow, even though corpora contained only a few thousands of data sets. We decided to adopt the Map-Reduce paradigm, in order to parallelize the query engine and improve performances. We passed through several versions of the query engine, either based on the Hadoop framework or on the Spark framework. Hadoop and Spark are two very popular frameworks for writing and executing parallel algorithms based on the Map-Reduce paradigm. In this paper, we present our study about the impact of adopting the Map-Reduce approach and its two most famous frameworks to parallelize the Hammer query engine; we discuss various implementations of the query engine, either obtained without significantly rewriting the algorithm or obtained by completely rewriting the algorithm by exploiting high level abstractions provided by Spark. The experimental campaign we performed shows the benefits provided by each studied solution, with the perspective of moving toward Big Data in the future. The lessons we learned are collected and synthesized into behavioral guidelines for developers approaching the problem of parallelizing algorithms by means of Map-Reduce frameworks.

“…The core of the Query Engine in the former Hammer framework was implemented by using the Map-Reduce paradigm [21]. The Query Engine of the new HammerJDB framework has inherited the query-engine core, thus it is a Map-Reduce algorithm too.…”

mentioning

confidence: 99%

Blind Queries Applied to JSON Document Stores

Marrara

Pelucchi²,

2019

Information

Self Cite

Social Media, Web Portals and, in general, information systems offer their own Application Programming Interfaces (APIs), used to provide large data sets concerning every aspect of day-by-day life. APIs usually provide data sets as collections of JSON documents. The heterogeneous structure of JSON documents returned by different APIs constitutes a barrier to effectively query and analyze these data sets. The adoption of NoSQL document stores, such as MongoDB, is useful for gathering these data sets, but does not solve the problem of querying the final heterogeneous repository. The aim of this paper is to provide analysts with a tool, named HammerJDB, that allows for blind querying collections of JSON documents within a NoSQL document database. The idea below is that users may know the application domain but it may be that they are not aware of the real structures of the documents stored in the database—the tool for blind querying tries to bridge the gap, by adopting a query rewriting mechanism. This paper is an evolution of a technique for blind querying Open Data portals and of its implementation within the Hammer framework, presented in some previous work. In this paper, we evolve that approach in order to query a NoSQL document database by evolving the Hammer framework into the HammerJDB framework, which is able to work on MongoDB databases. The effectiveness of the new approach is evaluated on a data set (derived from a real-life one), containing job-vacancy ads collected from European job portals.