Data intensive science at synchrotron based 3D x-ray imaging facilities

Carlo, Francesco De; Xiao, Xianghui; Fezzaa, Kamel; Wang, Steve; Schwarz, Nina; Jacobsen, Chris; Chawla, Nikhilesh; Fußeis, Florian

doi:10.1109/escience.2012.6404468

Cited by 4 publications

(2 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Large scale applications have been shown to benefit significantly on heterogeneous systems [23] for data-intensive science [24] and under multiple sites infrastructure [25]. We demonstrate the value of these arguments in a realistic scenario.…”

Section: Related Workmentioning

confidence: 92%

Workflow performance improvement using model-based scheduling over multiple clusters and clouds

Maheshwari

Jung

Meng

et al. 2016

Future Generation Computer Systems

View full text Add to dashboard Cite

Please cite this article as: K. Maheshwari, E.-S. Jung, J. Meng, V. Morozov, V. Vishwanath, R. Kettimuthu, Workflow performance improvement using model-based scheduling over multiple clusters and clouds, Future Generation Computer Systems (2015), http://dx. AbstractIn recent years, a variety of computational sites and resources have emerged, and users often have access to multiple resources that are distributed. These sites are heterogeneous in nature and performance of different tasks in a workflow varies from one site to another. Additionally, users typically have a limited resource allocation at each site capped by administrative policies. In such cases, judicious scheduling strategy is required in order to map tasks in the workflow to resources so that the workload is balanced among sites and the overhead is minimized in data transfer. Most existing systems either run the entire workflow in a single site or use naïve approaches to distribute the tasks across sites or leave it to the user to optimize the allocation of tasks to distributed resources. This results in a significant loss in productivity. We propose a multi-site workflow scheduling technique that uses performance models to predict the execution time on resources and dynamic probes to identify the achievable network throughput between sites. We evaluate our approach using real world applications using the Swift parallel and distributed execution framework. We use two distinct computational environments-geographically distributed multiple clusters and multiple clouds. We show that our approach improves the resource utilization and reduces execution time when compared to the default schedule.

show abstract

Section: Related Workmentioning

confidence: 92%

Workflow performance improvement using model-based scheduling over multiple clusters and clouds

Maheshwari

Jung

Meng

et al. 2016

Future Generation Computer Systems

View full text Add to dashboard Cite

show abstract

“…The analytics domain covers a broad scope, including sampling and experimental design, robustness to data shortcomings (e.g., size, sampling, foregrounds, or noise), inverse problems for parameter estimation, approximate algorithms (e.g., balancing performance and error controls), end-to-end propagation of uncertainty, and collaborative visualization. [WDM+01] that can produce 150 terabytes of data in one day if its detectors run at maximum capacity (although this is much more than the current average daily volume) [DXF+12] and NSLS-II, which will produce an average of ~75 terabytes per day after only the first few years of operation (15 petabytes per year). The variety of data at light sources compounds the challenges posed by increasing data volumes and rates.…”

Section: The Cosmic Frontiermentioning

confidence: 99%

Data Crosscutting Requirements Review

Dam

Shoshani

Plata

2013

View full text Add to dashboard Cite

Model-driven multisite workflow scheduling

Maheshwari

Jung

Meng

et al. 2013

2013 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

Abstract-Workflows continue to play an important role in expressing and deploying scientific applications. In recent years, a wide variety of computational sites have emerged with shared access to users. A user may not be able to complete a complex workflow at a single site. It is thus beneficial to run different tasks of a workflow on different sites. For such cases, judicious scheduling strategy is required in order to map tasks in the workflow to resources at multiple sites so that the workload is balanced among sites and the overhead is minimized in data transfer. The key challenge is that the data transfer rate among sites varies based on the network capacity and load.We propose a workflow scheduling technique that tackles the multi-site task distribution challenge by using data movement performance modeling. We applied this technique to schedule an earth observation science workflow over three sites. Executed via the Swift parallel scripting paradigm, we augmented its default schedule and improved the time-to-completion by up to 52%.

show abstract

Data intensive science at synchrotron based 3D x-ray imaging facilities

Cited by 4 publications

References 3 publications

Workflow performance improvement using model-based scheduling over multiple clusters and clouds

Workflow performance improvement using model-based scheduling over multiple clusters and clouds

Data Crosscutting Requirements Review

Model-driven multisite workflow scheduling

Contact Info

Product

Resources

About