Evaluating the Benefits of Key-Value Databases for Scientific Applications

Santamaria, Pol; Oden, Lena; Gil, Eloy; Becerra, Yolanda; Sirvent, Raül; Glock, Philipp; Torres, Jordi

doi:10.1007/978-3-030-22734-0_30

Cited by 2 publications

(2 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…NoSQL databases are data storage systems which support high availability and horizontal scalability, at the expense of lower consistency guarantees than standard SQL databases [7]. Apache Cassandra [17] is a distributed, decentralized and highly scalable NoSQL DB, it is a free and open-source project and is widely adopted both in industry (e.g., Netflix, Uber [6]) and big data analytics contexts [25] (e.g., in the CERN ATLAS project [26]). As for the performance, Apache Cassandra offers low-latency (typically less than a millisecond), high-bandwidth, concurrent accesses to the stored data, while supporting easy scalability, high availability and tunable data redundancy.…”

Section: Apache Cassandramentioning

confidence: 99%

Scaling deep learning data management with Cassandra DB

Versaci

Busonera

2021

2021 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

Deep learning (DL) algorithms require, to be fully effective, harvesting an increasingly large amount of data. These data, typically organized as millions of small files, stress filesystems and are difficult to manage. In fact, despite the huge development of DL tools and specialized hardware, data loading pipeline for DL still lacks behind in ease of use, standardization and scalability.In this work we try to rethink the data loading pipeline, by leveraging NoSQL DBs for storing both data and metadata, making them efficiently available through the network, and allowing easier data distribution for parallel DL training. We present our open-source, Apache Cassandra-based data loader and illustrate its use and performance, which enable easy and efficient data management and decentralized data distribution for parallel learning applications.

show abstract

Section: Apache Cassandramentioning

confidence: 99%

Scaling deep learning data management with Cassandra DB

Versaci

Busonera

2021

2021 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

show abstract

“…The first is handling the asynchronous communication with a high enough level of parallelism that can exploit the distributed database and thus achieve excellent performance. To this end, we used the C version of Hecuba [15] an HPC oriented library that we develop in our research group. Hecuba allows efficient use of NoSQL databases in MPI oriented applications by taking care of all the callback and asynchronous management of messages.…”

Section: A Hpc Integrationmentioning

confidence: 99%

The OTree: Multidimensional Indexing with efficient data Sampling for HPC

Cugnasco

Calmet

Santamaria

et al. 2019

2019 IEEE International Conference on Big Data (Big Data)

Self Cite

View full text Add to dashboard Cite

Spatial big data is considered an essential trend in future scientific and business applications. Indeed, research instruments, medical devices, and social networks generate hundreds of petabytes of spatial data per year. However, many authors have pointed out that the lack of specialized frameworks for multidimensional Big Data is limiting possible applications and precluding many scientific breakthroughs. Paramount in achieving High-Performance Data Analytics is to optimize and reduce the I/O operations required to analyze large data sets. To do so, we need to organize and index the data according to its multidimensional attributes. At the same time, to enable fast and interactive exploratory analysis, it is vital to generate approximate representations of large datasets efficiently. In this paper, we propose the Outlook Tree (or OTree), a novel Multidimensional Indexing with efficient data Sampling (MIS) algorithm. The OTree enables exploratory analysis of large multidimensional datasets with arbitrary precision, a vital missing feature in current distributed data management solutions. Our algorithm reduces the indexing overhead and achieves high performance even for write-intensive HPC applications. Indeed, we use the OTree to store the scientific results of a study on the efficiency of drug inhalers. Then we compare the OTree implementation on Apache Cassandra, named Qbeast, with PostgreSQL and plain storage. Lastly, we demonstrate that our proposal delivers better performance and scalability.

show abstract

Evaluating the Benefits of Key-Value Databases for Scientific Applications

Cited by 2 publications

References 14 publications

Scaling deep learning data management with Cassandra DB

Scaling deep learning data management with Cassandra DB

The OTree: Multidimensional Indexing with efficient data Sampling for HPC

Contact Info

Product

Resources

About