2021
DOI: 10.48550/arxiv.2106.09592
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Data lake concept and systems: a survey

Abstract: Although big data has been discussed for some years, it still has many research challenges, especially the variety of data. It poses a huge difficulty to efficiently integrate, access, and query the large volume of diverse data in information silos with the traditional 'schema-on-write' approaches such as data warehouses. Data lakes have been proposed as a solution to this problem. They are repositories storing raw data in its original formats and providing a common access interface. This survey reviews the de… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
11
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(11 citation statements)
references
References 75 publications
0
11
0
Order By: Relevance
“…Finally, Hai et al [ 54 ] propose a new definition: “A data lake is a flexible and scalable data storage and management system that ingests and stores raw data from heterogeneous sources in its original format, and provides query processing and data analysis on the fly.” A data lake is not considered only as a storage system and must support on-demand data processing and querying. Furthermore, data indexing should only be performed if necessary at the time of data access, ingestion of data sources could be light, as there is no need to force schema definitions and mappings beforehand.…”
Section: Data Lake Architecture Review and Requirementsmentioning
confidence: 99%
See 1 more Smart Citation
“…Finally, Hai et al [ 54 ] propose a new definition: “A data lake is a flexible and scalable data storage and management system that ingests and stores raw data from heterogeneous sources in its original format, and provides query processing and data analysis on the fly.” A data lake is not considered only as a storage system and must support on-demand data processing and querying. Furthermore, data indexing should only be performed if necessary at the time of data access, ingestion of data sources could be light, as there is no need to force schema definitions and mappings beforehand.…”
Section: Data Lake Architecture Review and Requirementsmentioning
confidence: 99%
“…Hai et al [ 54 ] identify three examples of real applications of data lakes. Our project is of the type “multiple entries of heterogeneous raw data” with additions on: Managing geographic coordinates; Having a full JSON architecture for data querying; Considering a data valid for the infrastructure as soon as it has a “what, where, when” information; Combining relational and not only SQL (NoSQL) storage and querying systems; Giving the access to the data to any kind of users, including open data users; Furthermore, finally managing metadata with a catalogue managing spatially referenced resources.…”
Section: Data Lake Architecture Review and Requirementsmentioning
confidence: 99%
“…As of today, a lot of development and analysis was conducted in the area of data lake architectures , where the so-called zone architecture (Patel et al, 2017 ; Ravat and Zhao, 2019 ), including the pond architecture (Inmon, 2016 ), became the most cited and used. These architectures have already been surveyed by Hai et al ( 2021 ) and Sawadogo and Darmont ( 2021 ), and a functional architecture was proposed by both of them, and a maturity and a hybrid architecture have been derived by Sawadogo and Darmont ( 2021 ). These surveys, however, did not include recent works like the definition of a zone reference model (Giebler et al, 2020 ) or a data lake architecture based on FAIR Digital Objects (FDOs) (Nolte and Wieder, 2022 ).…”
Section: Data Lake Architecturesmentioning
confidence: 99%
“…Within a functional -based architecture classification, the data lake is analyzed toward its operations which are performed on the data while moving through the general data lake workflow. Hai et al ( 2021 ) define three layers, ingestion, maintenance , and exploration , where corresponding functions are then sub-grouped. A similar definition is provided by Sawadogo and Darmont ( 2021 ), where the four main components of a data lake are defined as ingestion, storage, processing, querying .…”
Section: Data Lake Architecturesmentioning
confidence: 99%
“…There is a strong emphasis on data science and analysis. Data science is frequently performed on vast repositories, sometimes referred to as data lakes, that contain a large number of distinct datasets [8]. The datasets may be sparsely or completely schema-less [9].…”
Section: Introductionmentioning
confidence: 99%