Data streaming frameworks, constructed to work on large numbers of processing nodes in order to analyze big data, are fault-prone. Not only the large amount of nodes and network components that could fail are a source of errors. Development of data analyzing jobs has the disadvantage that errors or wrong assumptions about the input data may only be detected in productive processing. This usually leads to a reexecution of the entire job and re-computing all input data. This can be a tremendous profuseness of computing time if most of the job's tasks are not affected by these changes and therefore process and produce the same exact data again.This paper describes an approach to use materialized intermediate data from previous job executions to reduce the number of tasks that have to be re-executed in case of an updated job. Saving intermediate data to disk is a common technique to achieve fault tolerance in data streaming systems. These intermediate results can be used for memoization to avoid needless re-execution of tasks. We show that memoization can decrease the runtime of an updated job distinctly.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.