Data lake concept and systems: a survey

Hai, Rihan; Koutras, Christos; Quix, Christoph; Jarke, Matthias

doi:10.48550/arxiv.2106.09592

Cited by 7 publications

(11 citation statements)

References 75 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, Hai et al [ 54 ] propose a new definition: “A data lake is a flexible and scalable data storage and management system that ingests and stores raw data from heterogeneous sources in its original format, and provides query processing and data analysis on the fly.” A data lake is not considered only as a storage system and must support on-demand data processing and querying. Furthermore, data indexing should only be performed if necessary at the time of data access, ingestion of data sources could be light, as there is no need to force schema definitions and mappings beforehand.…”

Section: Data Lake Architecture Review and Requirementsmentioning

confidence: 99%

“…Hai et al [ 54 ] identify three examples of real applications of data lakes. Our project is of the type “multiple entries of heterogeneous raw data” with additions on: Managing geographic coordinates; Having a full JSON architecture for data querying; Considering a data valid for the infrastructure as soon as it has a “what, where, when” information; Combining relational and not only SQL (NoSQL) storage and querying systems; Giving the access to the data to any kind of users, including open data users; Furthermore, finally managing metadata with a catalogue managing spatially referenced resources.…”

Section: Data Lake Architecture Review and Requirementsmentioning

confidence: 99%

See 1 more Smart Citation

CEBA: A Data Lake for Data Sharing and Environmental Monitoring

Sarramia

Claude

Ogereau

et al. 2022

Sensors

View full text Add to dashboard Cite

This article presents a platform for environmental data named “Environmental Cloud for the Benefit of Agriculture” (CEBA). The CEBA should fill the gap of a regional institutional platform to share, search, store and visualize heterogeneous scientific data related to the environment and agricultural researches. One of the main features of this tool is its ease of use and the accessibility of all types of data. To answer the question of data description, a scientific consensus has been established around the qualification of data with at least the information “when” (time), “where” (geographical coordinates) and “what” (metadata). The development of an on-premise solution using the data lake concept to provide a cloud service for end-users with institutional authentication and for open data access has been completed. Compared to other platforms, CEBA fully supports the management of geographic coordinates at every stage of data management. A comprehensive JavaScript Objet Notation (JSON) architecture has been designed, among other things, to facilitate multi-stage data enrichment. Data from the wireless network are queried and accessed in near real-time, using a distributed JSON-based search engine.

show abstract

Section: Data Lake Architecture Review and Requirementsmentioning

confidence: 99%

Section: Data Lake Architecture Review and Requirementsmentioning

confidence: 99%

CEBA: A Data Lake for Data Sharing and Environmental Monitoring

Sarramia

Claude

Ogereau

et al. 2022

Sensors

View full text Add to dashboard Cite

show abstract

“…As of today, a lot of development and analysis was conducted in the area of data lake architectures , where the so-called zone architecture (Patel et al, 2017 ; Ravat and Zhao, 2019 ), including the pond architecture (Inmon, 2016 ), became the most cited and used. These architectures have already been surveyed by Hai et al ( 2021 ) and Sawadogo and Darmont ( 2021 ), and a functional architecture was proposed by both of them, and a maturity and a hybrid architecture have been derived by Sawadogo and Darmont ( 2021 ). These surveys, however, did not include recent works like the definition of a zone reference model (Giebler et al, 2020 ) or a data lake architecture based on FAIR Digital Objects (FDOs) (Nolte and Wieder, 2022 ).…”

Section: Data Lake Architecturesmentioning

confidence: 99%

“…Within a functional -based architecture classification, the data lake is analyzed toward its operations which are performed on the data while moving through the general data lake workflow. Hai et al ( 2021 ) define three layers, ingestion, maintenance , and exploration , where corresponding functions are then sub-grouped. A similar definition is provided by Sawadogo and Darmont ( 2021 ), where the four main components of a data lake are defined as ingestion, storage, processing, querying .…”

Section: Data Lake Architecturesmentioning

confidence: 99%

Toward data lakes as central building blocks for data management and analysis

Wieder

Nolte

2022

Front. Big Data

View full text Add to dashboard Cite

Data lakes are a fundamental building block for many industrial data analysis solutions and becoming increasingly popular in research. Often associated with big data use cases, data lakes are, for example, used as central data management systems of research institutions or as the core entity of machine learning pipelines. The basic underlying idea of retaining data in its native format within a data lake facilitates a large range of use cases and improves data reusability, especially when compared to the schema-on-write approach applied in data warehouses, where data is transformed prior to the actual storage to fit a predefined schema. Storing such massive amounts of raw data, however, has its very own challenges, spanning from the general data modeling, and indexing for concise querying to the integration of suitable and scalable compute capabilities. In this contribution, influential papers of the last decade have been selected to provide a comprehensive overview of developments and obtained results. The papers are analyzed with regard to the applicability of their input to data lakes that serve as central data management systems of research institutions. To achieve this, contributions to data lake architectures, metadata models, data provenance, workflow support, and FAIR principles are investigated. Last, but not least, these capabilities are mapped onto the requirements of two common research personae to identify open challenges. With that, potential research topics are determined, which have to be tackled toward the applicability of data lakes as central building blocks for research data management.

show abstract

“…There is a strong emphasis on data science and analysis. Data science is frequently performed on vast repositories, sometimes referred to as data lakes, that contain a large number of distinct datasets [8]. The datasets may be sparsely or completely schema-less [9].…”

Section: Introductionmentioning

confidence: 99%

Processing Analytical Queries over Polystore System for a Large Astronomy Data Repository

et al. 2022

View full text Add to dashboard Cite

There are extremely large heterogeneous databases in the astronomical data domain, which keep increasing in size. The data types vary from images of astronomical objects to unstructured texts, relations, and key-values. Many astronomical data repositories manage such kinds of data. The Zwicky Transient Facility (ZTF) is one such data repository with a large amount of data with different varieties. Handling different types of data in a single database may have performance and efficiency issues. In this study, we propose a web-based query system built around the Polystore database architecture, and attempt to provide a solution for the growing size of data in the astronomical domain. The proposed system will unify querying over multiple datasets directly, thereby eliminating the effort to translate complex queries and simplify the work for the users in the astronomical domain. In this proposal, we study the models of data integration, analyze them, and incorporate them into a system to manage linked open data provided by astronomical domain. The proposed system is scalable, and its model can be used for various other systems to efficiently manage heterogeneous data.

show abstract

Data lake concept and systems: a survey

Cited by 7 publications

References 75 publications

CEBA: A Data Lake for Data Sharing and Environmental Monitoring

CEBA: A Data Lake for Data Sharing and Environmental Monitoring

Toward data lakes as central building blocks for data management and analysis

Processing Analytical Queries over Polystore System for a Large Astronomy Data Repository

Contact Info

Product

Resources

About