2020
DOI: 10.3389/fbioe.2020.553904
|View full text |Cite
|
Sign up to set email alerts
|

On the Logical Design of a Prototypical Data Lake System for Biological Resources

Abstract: Biological resources are multifarious encompassing organisms, genetic materials, populations, or any other biotic components of ecosystems, and fine-grained data management and processing of these diverse types of resources proposes a tremendous challenge for both researchers and practitioners. Before the conceptualization of data lakes, former big data management platforms in the research fields of computational biology and biomedicine could not deal with many practical data management tasks very well. As an … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(4 citation statements)
references
References 41 publications
0
4
0
Order By: Relevance
“…It first extracts essential information representative of the original raw data, referred to as features, e.g., keywords and named entities. Then it provides services that add synonyms and stems to such features, while it connects them to open knowledge bases such as Google Knowledge Graph 22 , Wikidata 23 . CoreDB also annotates and groups the data sources in the data lake.…”
Section: Semantic Metadata Enrichmentmentioning
confidence: 99%
See 1 more Smart Citation
“…It first extracts essential information representative of the original raw data, referred to as features, e.g., keywords and named entities. Then it provides services that add synonyms and stems to such features, while it connects them to open knowledge bases such as Google Knowledge Graph 22 , Wikidata 23 . CoreDB also annotates and groups the data sources in the data lake.…”
Section: Semantic Metadata Enrichmentmentioning
confidence: 99%
“…In essence, a data lake is a flexible, scalable data storage and management system, which ingests and stores raw data from heterogeneous sources in their original format, and provides maintenance, query processing and data analytics in an on-the-fly manner, with the help of rich metadata [116], [138], [142], [143]. Data lakes are proposed to store and manage data in many real-life use cases: Internet of things (IoT) and smart city [99], manufacturing [112], medicine [42], [55], [114], mobility service (e.g., Uber) [50], biology [23], smart grids [20], [103], air quality control [145], flights data [96], disease control, labor markets and products [13].…”
Section: Introductionmentioning
confidence: 99%
“…This method, originally proposed for the management of large transactional datasets (‘big data’), has become a generalized solution for management of heterogeneous data that offers benefits such as cost-effectiveness, high scalability, data fidelity , real-time data ingestion and fault tolerance 51 . Mature examples of this method implement tiered access layers to ensure that sensitive participant data is protected 52 . The use of a semi-structured data storage approach also facilitates the iterative development and application of rules-based and inference-based participant selection methods.…”
Section: Building Future Trial-ready Cohortsmentioning
confidence: 99%
“…Even when standards are adopted, the standardized structured metadata is often unexposed and not reusable. The proliferation and fragmentation of incomplete data repositories, lack of organization of data in endless Data Lakes 32 or in repositories with insufficient metadata, and the lack of common metadata standards make it difficult to combine separate data resources into a single searchable index. While standardizing metadata will not be sufficient to fully combine research data and code from different sources and enable meta-analyses, it is nevertheless a crucial first step towards this goal.…”
Section: Introductionmentioning
confidence: 99%