Modeling and Implementation of an Asynchronous Approach to Integrating HPC and Big Data Analysis 1

Fu, Yuankun; Song, Fengguang; Zhu, Luoding

doi:10.1016/j.procs.2016.05.297

Cited by 4 publications

(19 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figure 14.4 illustrates the traditional method, which is the simplest method without optimizations (next subsection will show an optimized version of the traditional method) [8]. The traditional method works as follows: the compute processes compute results and write computed results to disks, followed by the analysis processes reading results and then analyzing the results.…”

Section: The Traditional Methodsmentioning

confidence: 99%

“…The other improvement is that the user input is divided into a number of fine-grain blocks and written to disks asynchronously. Figure 14.5 shows this improved version of the traditional method [8]. We can see that the output stage is now overlapped with the computation stage so that the output time might be hidden by the computation time.…”

Section: Improved Version Of the Traditional Methodsmentioning

confidence: 99%

“…Each fine-grain block goes through four steps: computation, output, input, and analysis. As shown in Figure 14.1, our new end-to-end time-to-solution is equal to the maximum of the the computation time, the output time, the input time, and the analysis time (i.e., the time of a single step only) [8]. Furthermore, we build an analytical model to predict the overall time-to-solution to integrate computation and analysis, which provides developers with an insight into how to efficiently combine them.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

SDN helps Big Data to optimize access to data

Fu,

Song

2020

Preprint

Self Cite

View full text Add to dashboard Cite

This chapter introduces the state-of-the-art in the emerging area of combining High Performance Computing (HPC) with Big Data Analysis. To understand the new area, the chapter first surveys the existing approaches to integrating HPC with Big Data. Next, the chapter introduces several optimization solutions that focus on how to minimize the data transfer time from computation-intensive applications to analysisintensive applications as well as minimizing the end-to-end time-to-solution. The solutions utilize SDN to adaptively use both high speed interconnect network and high performance parallel file systems to optimize the application performance. A computational framework called DataBroker is designed and developed to enable a tight integration of HPC with data analysis. Multiple types of experiments have been conducted to show different performance issues in both message passing and parallel file systems and to verify the effectiveness of the proposed research approaches.

show abstract

Section: The Traditional Methodsmentioning

confidence: 99%

Section: Improved Version Of the Traditional Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

SDN helps Big Data to optimize access to data

Fu,

Song

2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Another research area is focused on developing APIs to speed up data access for analytics use cases [17,25,37]. GLEAN [37] is a framework that provides an infrastructure for accelerating I/O interfacing with running simulations for co-analysis and supports in-situ analysis.…”

Section: Related Workmentioning

confidence: 99%

“…With the rise in popularity of using ML to analyze simulation data and automate scientific workflows [14,15,20,21,23,24], there is growing interest in exploiting Spark for HPC simulation data. Recent work on using Spark varies from implementing traditional analysis pipelines [26,33,34] to developing APIs to speed up data access [17,25,37].…”

mentioning

confidence: 99%

Exploiting Spark for HPC Simulation Data

Jiang

Gallagher

Chu

et al. 2020

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

View full text Add to dashboard Cite

In this paper, we address the challenge of analyzing simulation data on HPC systems by using Apache Spark, which is a Big Data framework. One of the main problems we encountered with using Spark on HPC systems is the ephemeral data explosion, which is brought about by the curse of persistence in the Spark framework. Data persistence is essential in reducing I/O, but it comes at the cost of storage space. We show that in some cases, Spark scratch data can consume an order of magnitude more space than the input data being analyzed, leading to fatal out-of-disk errors. We investigate the real-world application of scaling machine learning algorithms to predict and analyze failures in multi-physics simulations on 76TB of data (over one trillion training examples). This problem is 2-3 orders of magnitude larger than prior work. Based on extensive experiments at scale, we provide several concrete recommendations as state-of-the-practice, and demonstrate a 7x reduction in disk utilization with negligible increases or even decreases in runtime.

show abstract