Advanced data flow support for scientific grid workflow applications

Qin, Jun; Fahringer, Thomas

doi:10.1145/1362622.1362679

Cited by 32 publications

(13 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Conventionally data is conceived as "a collection of facts from which conclusions may be drawn" [41] or "group(s) of information that represent the qualitative or quantitative attributes of a variable or set of variables" [40]. Such type of data is also called statistical data.…”

Section: Datamentioning

confidence: 99%

“…Data replication is commonly used to ensure high availability, reliability, fault tolerance, and efficient access of data.  Data Collection: The data collection refers to more than one sets of data [40]. A data management system in large scale distributed environments like Grid, should provide the functionality of data collection for efficient referencing of the large data sets.…”

Section: Data Management Tasks For High Performance Environmentsmentioning

confidence: 99%

See 1 more Smart Citation

A Taxonomy of Data Management Models in Distributed and Grid Environments

Nadeem

2016

IJITCS

View full text Add to dashboard Cite

Abstract-The distributed environments vary largely in their architectures, from tightly coupled cluster environment to loosely coupled Grid environment and completely uncoupled peer-to-peer environment, and thus differ in their working environments as well as performance. To meet the specific needs of these environments for data organization, replication, transfer, scheduling etc. the data management systems implement different data management models. In this paper, major data management tasks in distributed environments are identified and a taxonomy of the data management models in these environments is presented. The taxonomy is used to highlight the specific data management requirements of each environment and highlight the strengths and weakness of the implemented data management models. The taxonomy is followed by a survey of different distributed and Grid environments and the data management models they implement. The taxonomy and the survey results are used to identify the issues and challenges of data management for future exploration.

show abstract

Section: Datamentioning

confidence: 99%

Section: Data Management Tasks For High Performance Environmentsmentioning

confidence: 99%

A Taxonomy of Data Management Models in Distributed and Grid Environments

Nadeem

2016

IJITCS

View full text Add to dashboard Cite

show abstract

“…For instance, in [33], McClatchey et al introduce a prototype scientific workflow management system entitled CRISTAL, and the distributed scientific workflow applications that they consider are SPGs. In [41], Qin and Fahringer discuss several scientific grid workflow applications, which are all structured as SPGs: the WIEN2k workflow performs electronic structure calculations of solids using density functional theory [7], the MeteoAG workflow is a meteorology simulation application [43], and the GRASIL workflow calculates the spectral energy distribution of galaxies [44]; this latter application has actually a fork-join graph. A last example is the fMRI workflow [52], which is a cognitive neuroscience application.…”

Section: Related Workmentioning

confidence: 99%

Energy-Aware Mappings of Series-Parallel Workflows onto Chip Multiprocessors

Benoît

Renaud-Goud

Robert

et al. 2011

2011 International Conference on Parallel Processing

View full text Add to dashboard Cite

This paper studies the problem of mapping streaming applications that can be modeled by a series-parallel graph, onto a 2-dimensional tiled CMP architecture. The objective of the mapping is to minimize the energy consumption, using dynamic and voltage scaling techniques, while maintaining a given level of performance, reflected by the rate of processing the data streams. This mapping problem turns out to be NP-hard, but we identify simpler instances, whose optimal solution can be computed by a dynamic programming algorithm in polynomial time. Several heuristics are proposed to tackle the general problem, building upon the theoretical results. Finally, we assess the performance of the heuristics through comprehensive simulations using the StreamIt workflow suite and various CMP grid sizes.Key-words: series-parallel graph; DAG; mapping; multicore; CMP; energy; power; period; throughput; DVS; DVFS; complexity; simulation; streaming applications; optimization. Energy-aware mappings of series-parallel workflows onto chip multiprocessorsRésumé : Dans ce rapport de recherche, nous nous intéressons au placement d'applications de type streaming représentées sous la forme d'un graphe série-parallèle sur un processeur multi-coeur, en essayant de minimiser l'énergie consommée tout en n'excédant pas une borne sur un critère de performance, la période. La partie théorique démontre la NP-complétude ou la polynomialité du problème, selon des propriétés structurelles du multi-coeur (chaîne de coeurs, uni-ou bi-directionnelle, grille de coeurs) et la largeur du graphe de l'application (bornée ou non). Le problème le moins contraintétant NP-complet, nous décrivons dans la partie expérimentale quatre heuristiques, puis les comparons entre elles, et donnons un programme linéaire en nombres entiers qui permet d'obtenir la solution optimale en temps exponentiel.

show abstract

“…For instance, many systems (e.g., [27,32,36,31,23,39,29,30,18,26]) support actors that make only small changes or updates to incoming data, passing on some or all of their input to downstream actors. Thus, if invocation a above retains within its output y some unchanged substructure s from its input x, denoted as 2 x = (s ⊕ x 0 ), y = (s ⊕ y 0 ) then s will be stored twice: once in the trace record in(x, a) (call this occurrence s x ) and once in out(a, y) (call this occurrence s y ).…”

Section: Introductionmentioning

confidence: 99%

“…Furthermore, because actors often wrap complex external applications and services, various patterns of data dependencies (e.g., see [31,32,36,7]) can arise in which not all parts of the output depend on all parts of the input. Assume, e.g., that invocation a above receives input x and produces output y as follows x = (x 1 ⊕ .…”

Section: Introductionmentioning

confidence: 99%

Efficient provenance storage over nested data collections

Anand

Bowers

McPhillips

et al. 2009

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology

View full text Add to dashboard Cite

Scientific workflow systems are increasingly used to automate complex data analyses, largely due to their benefits over traditional approaches for workflow design, optimization, and provenance recording. Many workflow systems employ a simple dependency model to represent the provenance of data produced by workflow runs. Although commonly adopted, this model does not capture explicit data dependencies introduced by "provenance-aware" processes, and it can lead to inefficient storage when workflow data is complex or structured. We present a provenance model, extending the conventional approach, that supports (i) explicit data dependencies and (ii) nested data collections. Our model adopts techniques from reference-based XML versioning, adding annotations for process and data dependencies. We present strategies and reduction techniques to store immediate and transitive provenance information within our model, and examine trade-offs among update time, storage size, and query response time. We evaluate our approach on real-world and synthetic workflow execution traces, demonstrating significant reductions in storage size, while also reducing the time required to store and query provenance information.

show abstract

Advanced data flow support for scientific grid workflow applications

Cited by 32 publications

References 18 publications

A Taxonomy of Data Management Models in Distributed and Grid Environments

A Taxonomy of Data Management Models in Distributed and Grid Environments

Energy-Aware Mappings of Series-Parallel Workflows onto Chip Multiprocessors

Efficient provenance storage over nested data collections

Contact Info

Product

Resources

About