2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS) 2019
DOI: 10.1109/icdcs.2019.00161
|View full text |Cite
|
Sign up to set email alerts
|

The Best of Both Worlds: Challenges in Linking Provenance and Explainability in Distributed Machine Learning

Abstract: Machine learning experts prefer to think of their input as a single, homogeneous, and consistent data set. However, when analyzing large volumes of data, the entire data set may not be manageable on a single server, but must be stored on a distributed file system instead. Moreover, with the pressing demand to deliver explainable models, the experts may no longer focus on the machine learning algorithms in isolation, but must take into account the distributed nature of the data stored, as well as the impact of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 44 publications
0
5
0
Order By: Relevance
“…With the recent interest in DL methods, several works propose provenance management approaches for data analysis during DNN training [11]. There are several challenges in making ML workflows provenance aware like taking into account the execution framework that may involve CPUs, GPUs, TPUs, and distributed environments such as clusters and clouds as discussed in [36,42,14]. In this section, we discuss related work for provenance data management, considering the intention of using provenance for runtime data analysis.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…With the recent interest in DL methods, several works propose provenance management approaches for data analysis during DNN training [11]. There are several challenges in making ML workflows provenance aware like taking into account the execution framework that may involve CPUs, GPUs, TPUs, and distributed environments such as clusters and clouds as discussed in [36,42,14]. In this section, we discuss related work for provenance data management, considering the intention of using provenance for runtime data analysis.…”
Section: Related Workmentioning
confidence: 99%
“…The approaches in this category manage provenance for several purposes in ML platforms [46,35,24,40,2,36,30,41,12]. They are all based on a proprietary representation of provenance data, i.e., that does not follow recommendations like W3C PROV.…”
Section: Machine-and Deep Learning-specific Approachesmentioning
confidence: 99%
See 1 more Smart Citation
“…Specifically, a number of tools are available to help developers build machine learning pipelines [50,18,51] or debug them [52], but these lack the ability to explain the provenance of a certain data item in the processed dataset. Others link provenance to explainability in a distributed machine learning setting [53] but without offering specific tools. Amazon identifies that there are common and reusable components to a machine learning pipeline, but that there is no way to track the exploration of pipeline construction effectively, and calls for metadata capture to support reasoning over pipeline design [54].…”
Section: Related Workmentioning
confidence: 99%
“…To explain how a DML algorithm gives a decision, all transformations applied to the data should be considered. It is claimed in [214] that even basic transformations in data pre-processing, such as data partitioning, local data cleaning and value imputation, can have a strong impact on the resultant model. The effect becomes more apparent under a distributed setting.…”
Section: B Open Issuesmentioning
confidence: 99%