IntroductionData provenance is information about the entities, activities and people who have effected some type of transformation on a data product through the product's lifecycle. Data provenance captured from scientific applications is a critical precursor to data sharing and reuse. For researchers wanting to repurpose and reuse data, it is a source of information about the lineage and attribution of the data and this is needed in order to establish trust in a data set. Data provenance has been shown useful in results validation, failure tracing, and reproducibility. The Komadu provenance capture system is standalone, meaning it is not coupled to or dependent upon any database management system, repository, or scientific workflow system. It provides an ingest API through which provenance notifications are fed into the system at high speeds, and a query API through which provenance information can be queried. The data model is both event oriented and graph oriented, in that graphs are pieced together in Komadu based on the events received from the environment.Komadu has its roots in the Karma [2] provenance capture system, an earlier version that complied with the OPM community standard [3] both for defining the type of provenance notifications that the system accepted, and for defining the format of the results. Komadu, on the other hand, supports the W3C PROV specification [1] which provides far richer types of relationships and has a more formal model for handling time than does OPM. Karma was additionally limited by assuming that every notification belonging to the same external activity shared a common global identifier that is shared across all components (services, methods etc.) of the external environment. This limitation was found to be severe in applications where provenance is not only captured at the application level, but also at in the larger environment where the application runs. Take for instance a distributed application running in PlanetLab [7] and running under Twister [8]; it is highly limiting to expect provenance events generated from the application, from PlanetLab, and from Twister to all have shared knowledge about any single global identifier. This limitation derives from Karma's early days where it tracked provenance for applications running within a single workflow system. Additionally, a researcher may be interested in tracking lineage starting from some data product or agent. Such scenarios are not supported by Karma.In this paper, we introduce Komadu [9] provenance capture and visualization system. Komadu is a complete redesign and reimplementation of Karma that supports new features while addressing the above mentioned limitations of Karma. The main contributions of Komadu are as follows. . Even though Komadu has been used most extensively in relation to scientific research, its interfaces are designed to collect and visualize provenance of any kind of application needing provenance.
This dissertation is a result of an effort over many years. There are so many people who helped me in various ways during this endeavor. Without their generous support and encouragement, this work would not have been possible. First of all, I am so grateful to my Ph.D. advisor, Prof. Beth Plale for her invaluable support, guidance, and encouragement throughout my Ph.D. Her research experience over many years across multiple areas of Computer Science helped me in many ways to solve hard research problems and to successfully present them as publications. In addition to that, she was so kind to me and my family during our hard times. I am truly honored to have worked with her throughout my Ph.D. studies. I would like to thank my research committee members Prof. David Leake, Prof. Ryan Newton and Prof. Judy Qiu for their guidance and advice on my qualifying exams, thesis proposal, and final dissertation. I should thank all professors at the School of Informatics, Computing, and Engineering from whom I took a number of courses which helped immensely to improve my knowledge and skills.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with đź’™ for researchers
Part of the Research Solutions Family.