Accelerating large-scale data exploration through data diffusion

Fei

2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery

et al. 2011

Self Cite

Abstract-Cloud computing is gaining tremendous momentum in both academia and industry. The application of Cloud computing, however, has mostly focused on Web applications and business applications; while the recognition of using Cloud computing to support large-scale workflows, especially dataintensive scientific workflows on the Cloud is still largely overlooked. We coin the term "Cloud Workflow", to refer to the specification, execution, provenance tracking of large-scale scientific workflows, as well as the management of data and computing resources to enable the execution of scientific workflows on the Cloud. In this paper, we analyze why there has been such a gap between the two technologies, and what it means to bring Cloud and workflow together; we then present the key challenges in running Cloud workflow, and discuss the research opportunities in realizing workflows on the Cloud.

Section: E Data Management Challengementioning

confidence: 99%

Opportunities and Challenges in Running Scientific Workflows on the Cloud

Fei

2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery

et al. 2011

Self Cite

“…gigabit Ethernet) as well as proprietary and more exotic networks (Torus, Tree, and Infiniband). [9,16] We believe that there is more to HPC than tightly coupled MPI, and more to HTC than embarrassingly parallel long running jobs. Like HPC applications, and science itself, applications are becoming increasingly complex opening new doors for many opportunities to apply HPC in new ways if we broaden our perspective.…”

Section: Discussionmentioning

confidence: 99%

“…3) Keeping data size modest, but increasing the number of tasks moves us into the loosely coupled applications involving many tasks (yellow); Swift/Falkon [6,7] and Pegasus/DAGMan [8] are examples of this category. 4) Finally, the combination of both many tasks and large datasets moves us into the data-intensive many-task computing category (green); examples of this category are Swift/Falkon and data diffusion [9], Dryad [ Sawzall [11].…”

Section: Defining Many Task Computingmentioning

confidence: 99%

Many-task computing for grids and supercomputers

Foster

2008 Workshop on Many-Task Computing on Grids and Supercomputers

2008

Self Cite

214

167

“…TABLE III To the best of our knowledge, HyCache is the first user-level POSIX-compliant hybrid caching for distributed file systems. Some of our previous work [15][16][17] proposed data caching to accelerate applications by modifying the applications and/or their workflow, rather than the at the filesystem level. Other existing work requires modifying OS kernel, or lacks of a systematic caching mechanism for manipulating files across multiple storage devices, or does not support the POSIX interface.…”

Section: Applicationmentioning

confidence: 99%

HyCache: A User-Level Caching Middleware for Distributed File Systems

2013 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum

2013

Self Cite

Abstract-One of the bottlenecks of distributed file systems deals with mechanical hard drives (HDD). Although solid-state drives (SSD) have been around since the 1990's, HDDs are still dominant due to large capacity and relatively low cost. Hybrid hard drives with a small built-in SSD cache does not meet the need of a large variety of workloads. This paper proposes a middleware that manages the underlying heterogeneous storage devices in order to allow distributed file systems to leverage the SSD performance while leveraging the capacity of HDD. We design and implement a user-level filesystem, HyCache, that can offer SSD-like performance at a cost similar to a HDD. We show how HyCache can be used to improve performance in distributed file systems, such as the Hadoop HDFS. Experiments show that HyCache achieves up to 7X higher throughput and 76X higher IOPS than Linux Ext4 file system, and can accelerate HDFS by 28% at 32-node scales (compared to vanilla HDFS).Index Terms-distributed file systems, user level file systems, hybrid file systems, heterogeneous storage, SSD