Understanding Data Science Lifecycle Provenance via Graph Segmentation and Summarization

Miao, Hui; Deshpande, Amol

doi:10.1109/icde.2019.00179

Cited by 9 publications

(5 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Data science helps to process and supports the derivation of valuable data [ 59 ]. Figure 8 illustrates the data science lifecycle [ 12 , 60 ]. The lifecycle is based on the seven iterative steps such as data cleaning, data exploration, and data mining, etc.…”

Section: Next-generation Advancements In the Internet Of Things (Iot)...mentioning

confidence: 99%

A Step toward Next-Generation Advancements in the Internet of Things Technologies

Amin

Abbasi

Mateen

et al. 2022

Sensors

View full text Add to dashboard Cite

The Internet of Things (IoT) devices generate a large amount of data over networks; therefore, the efficiency, complexity, interfaces, dynamics, robustness, and interaction need to be re-examined on a large scale. This phenomenon will lead to seamless network connectivity and the capability to provide support for the IoT. The traditional IoT is not enough to provide support. Therefore, we designed this study to provide a systematic analysis of next-generation advancements in the IoT. We propose a systematic catalog that covers the most recent advances in the traditional IoT. An overview of the IoT from the perspectives of big data, data science, and network science disciplines and also connecting technologies is given. We highlight the conceptual view of the IoT, key concepts, growth, and most recent trends. We discuss and highlight the importance and the integration of big data, data science, and network science along with key applications such as artificial intelligence, machine learning, blockchain, federated learning, etc. Finally, we discuss various challenges and issues of IoT such as architecture, integration, data provenance, and important applications such as cloud and edge computing, etc. This article will provide aid to the readers and other researchers in an understanding of the IoT’s next-generation developments and tell how they apply to the real world.

show abstract

Section: Next-generation Advancements In the Internet Of Things (Iot)...mentioning

confidence: 99%

A Step toward Next-Generation Advancements in the Internet of Things Technologies

Amin

Abbasi

Mateen

et al. 2022

Sensors

View full text Add to dashboard Cite

show abstract

“…The issues of inter-connectedness and size of provenance graphs have similarly emerged in different domains, wherein techniques such as user views, segmentation, and aggregation have been explored to transform the graphs to usable or interpretable ones [9,17,37,38]. We adopt a similar approach but we leverage the semantics of production-ML operators and connections between them.…”

Section: Model Graphletsmentioning

confidence: 99%

“…Previous work has even led to the standardization of provenance representations for workflows in the form of graphs [26,39,40]. Other research has proposed various ways to explore and analyze such provenance graphs, e.g., visualization [14], reachability query support [13], support for user-defined views [17], segmentation and summarization [9,10,37]. Our work introduces a framework to segment ML provenance graphs and demonstrates how this segmentation leads to further analysis and optimizations for ML pipelines.…”

Section: Introductionmentioning

confidence: 99%

Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities

Xin,

Miao,

Parameswaran

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Machine learning (ML) is now commonplace, powering data-driven applications in various organizations. Unlike the traditional perception of ML in research, ML production pipelines are complex, with many interlocking analytical components beyond training, whose sub-parts are often run multiple times on overlapping subsets of data. However, there is a lack of quantitative evidence regarding the lifespan, architecture, frequency, and complexity of these pipelines to understand how data management research can be used to make them more efficient, effective, robust, and reproducible. To that end, we analyze the provenance graphs of 3000 production ML pipelines at Google, comprising over 450,000 models trained, spanning a period of over four months, in an effort to understand the complexity and challenges underlying production ML. Our analysis reveals the characteristics, components, and topologies of typical industrystrength ML pipelines at various granularities. Along the way, we introduce a specialized data model for representing and reasoning about repeatedly run components in these ML pipelines, which we call model graphlets. We identify several rich opportunities for optimization, leveraging traditional data management ideas. We show how targeting even one of these opportunities, i.e., identifying and pruning wasted computation that does not translate to model deployment, can reduce wasted computation cost by 50% without compromising the model deployment cadence.

show abstract

“…Furthermore, given information about the place data originated from, how they come in their present states, and who or what acted on them helps users to establish trust in the data. Provenance can show resources and relations that have affected the construction of the output data and are commonly expressed as directed graphs (digraphs) [17]. The primary aim of the W3C standardized provenance is to enable the extensive publication and exchange of provenance over the web [18].…”

Section: Conceptual View Of Data Provenancementioning

confidence: 99%

An Interactive and Predictive Pre-diagnostic Model for Healthcare based on Data Provenance

Ahmed

Hussien

2019

UHD J SCI TECH

View full text Add to dashboard Cite

The future of healthcare may look completely different from the current clinic-center services. Rapidly growing and developing technologies are expected to change clinics throughout the world. However, the healthcare delivered to impaired patients, such as elderly and disabled people, possibly still requires hands-on human expertise. The aim of this study is to propose a predictive model that pre-diagnose illnesses by analyzing symptoms that are interactively taken from patients via several hand gestures during a period of time. This is particularly helpful in assisting clinicians and doctors to gain better understanding and make more accurate decisions about future plans for their patients’ situations. The hand gestures are detected, the time of the gesture is recorded and then they are associated to their designated symptoms. This information is captured in the form of provenance graphs constructed based on the W3C PROV data model. The provenance graph is analyzed by extracting several network metrics and then supervised machine-learning algorithms are used to build a predictive model. The model is used to predict diseases from the symptoms with a maximum accuracy of 84.5%.

show abstract

Understanding Data Science Lifecycle Provenance via Graph Segmentation and Summarization

Cited by 9 publications

References 54 publications

A Step toward Next-Generation Advancements in the Internet of Things Technologies

A Step toward Next-Generation Advancements in the Internet of Things Technologies

Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities

An Interactive and Predictive Pre-diagnostic Model for Healthcare based on Data Provenance

Contact Info

Product

Resources

About