2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS) 2019
DOI: 10.1109/works49585.2019.00006
|View full text |Cite
|
Sign up to set email alerts
|

Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

Abstract: Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stakeholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
39
0
2

Year Published

2020
2020
2022
2022

Publication Types

Select...
3
3
2

Relationship

3
5

Authors

Journals

citations
Cited by 32 publications
(41 citation statements)
references
References 17 publications
0
39
0
2
Order By: Relevance
“…Considering performance, we validated ProvLake in another ML scenario, also in O&G industry, evaluating the performance with 48 GPUs in parallel, and the data capture overhead was less than 1% [Souza et al 2019a].…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Considering performance, we validated ProvLake in another ML scenario, also in O&G industry, evaluating the performance with 48 GPUs in parallel, and the data capture overhead was less than 1% [Souza et al 2019a].…”
Section: Discussionmentioning
confidence: 99%
“…While capturing the data being processed in the lifecycle, ProvLake logically integrates and ingests them into a provenance database, named ProvLake Data View (PLView), ready for analyses at runtime [Souza et al 2019b]. The tool captures provenance of the three phases of ML lifecycle: data curation, data preparation for learning, and learning [Souza et al 2019a]. It then gives an integrated view of domain data, execution data, and ML data in multiworkflows supporting queries and analysis on such data.…”
Section: Introductionmentioning
confidence: 99%
“…Data provenance (or data lineage) methods aim to improve replication, tracing, quality assessment in data use and data transformation processes [19]. Several researchers have proposed data provenance and lineage solutions for the tracking of data and data transformations during the machine learning lifecycle [43,44,51]. Further, Bertino et al [8] proposes the use of blockchain technology to encourage data transparency and ensure that data collection and utilization coincide with ethical principles.…”
Section: Data Transparencymentioning
confidence: 99%
“…Thus, we propose another aspect for building a context from the bottom up and separating it into a similar method of abstraction, while focusing on generating it as a knowledge base. For solving the issues or hurdles mentioned above, studies with Automated Machine Learning (AutoML) 34 and workflow components for machine learning (ML) applications 35,36 have been conducted. Particularly Auto-WEKA 37 or auto-sklearn 38 are the ones of AutoML concepts to automate analysis tasks.…”
Section: Related Workmentioning
confidence: 99%