Creating Personal Adaptive Clusters for Managing Scientific Jobs in a Distributed Computing Environment

Walker, E. Ronald; Gardner, J. P.; Litvin, V.; Turner, Evan L.

doi:10.1109/clade.2006.1652061

Cited by 34 publications

(20 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Others have also used this technique. For example, Mehta et al [39] embed a Condor pool in a batchscheduled cluster, while MyCluster [40] creates "personal clusters" running Condor or SGE. Such "virtual clusters" can be dedicated to a single workload; thus, Singh et al find, in a simulation study [41], a reduction of about 50% in completion time.…”

Section: Related Workmentioning

confidence: 99%

“…In previous work [7], we measured Condor (v6.7.2, via MyCluster [40]) and PBS (v2.1.8) performance in a Linux environment (the same environment where we test Falkon and achieved 2534 tasks/s throughputs). The throughputs we measured for PBS was 0.45 tasks/s and for Condor was 0.49 tasks/s; other studies in the literature have measured Condor's performance as high as 22 tasks/s in a research prototype called Condor J2 [30].…”

Section: Comparing Falkon To Other Lrms and Solutionsmentioning

confidence: 99%

See 1 more Smart Citation

Middleware support for many-task computing

et al. 2010

View full text Add to dashboard Cite

While the I/O functions described in the MPI standard included shared file pointer support from the beginning, the performance and portability of these functions have been subpar at best. ROMIO [1], which provides the MPI-IO functionality for most MPI libraries, to this day uses a separate file to manage the shared file pointer. This file provides the shared location that holds the current value of the shared file pointer. Unfortunately, each access to the shared file pointer involves file lock management and updates to the file contents. Furthermore , support for shared file pointers is not universally available because few file systems support native shared file pointers [5] and a few file systems do not support file locks [3]. Application developers rarely use shared file pointers, even though many applications can benefit from this file I/O capability. These applications are typically loosely coupled and rarely exhibit application-wide synchronization. Examples include application tracing toolkits [8,4] and many-task computing applications [10]. Other approaches to the shared file pointer I/O models frequently used by these application classes include file-per-process, file-per-thread, and file-per-rank approaches. While these approaches work relatively well at smaller scales, they fail to scale to leadership-class computing systems because of the intense metadata loads generated they generate. Recent research identified significant improvements from using shared-file I/O instead of multifile I/O patterns on leadership-class systems [6]. In this paper, we propose integrating shared file support into the I/O forwarding layer commonly found on leadership-class computing systems. I/O forwarding middleware, such as the I/O Forwarding Scalability Layer (IOFSL) [9,2], bridges the compute and I/O subsystems of leadership-class computing systems. This middleware layer captures all file I/O requests generated by applications executing on compute nodes and forwards them to dedicated I/O nodes. These I/O nodes, a common hardware feature of leadership-class computing systems, execute the I/O requests on behalf of the application. The I/O forwarding layer on these system is best suited to provide and manage shared file pointers because it has access to all application I/O requests and can provide enhanced file I/O capabilities independent of the system and I/O software stack. By embedding this capability into the I/O forwarding layer, applications developers can utilize shared file pointers for a variety of file I/O APIs (MPI-IO, POSIX, and ZOIDFS), synchronization levels (collective and independent I/O), and computing systems (IBM Blue Gene and Cray XT systems). We are adding several features to IOFSL and ROMIO to enable portable MPI-IO shared file pointer access. In prior work, we extended the ZOIDFS API [2] to provide a distributed atomic append capability. Our current work extends and generalizes this capability to provide shared file pointers as defined by the MPI standard. First, we created a per file shared (key,v...

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Comparing Falkon To Other Lrms and Solutionsmentioning

confidence: 99%

Middleware support for many-task computing

et al. 2010

View full text Add to dashboard Cite

show abstract

“…In particular, virtual cluster implementation is one of the most explored use cases in data center virtualization, specially in the case of computing clusters. Traditionally, these methods consist of overlaying a custom software stack on top of an existing middleware layer, see for example the My-Cluster Project [10] or the Falkon system [11]. These approaches essentially shift the scalability issues from the application to the overlaid software layer, whereas the proposed solution transparently scales both the application and the computational cluster.…”

Section: Related Workmentioning

confidence: 99%

Elastic management of cluster-based services in the cloud

Moreno-Vozmediano

Montero

Llorente

2009

Proceedings of the 1st Workshop on Automated Control for Datacenters and Clouds

View full text Add to dashboard Cite

In this paper we analyze the deployment of generic clustered services on top of a virtualized infrastructure layer that combines a VM manager (the OpenNebula engine) and a cloud resource provider (Amazon EC2). The use of this virtualization layer between the service and the physical infrastructure extends the classical benefits of VM platforms to distributed infrastructures. Additionally, the integration of the cloud in this layer allows us to give additional capacity to the services using an external provider, thus complementing the local infrastructure without notice from the users or affecting the service workload. This flexible approach, which separates the resource provisioning from the service management, provides important benefits: elastic service capacity to adapt it to its dynamic workload; physical infrastructure partitioning to isolate it from other running services; and support for heterogeneous configurations tailored for each service class. The feasibility of the proposed approach is analyzed for two different clustered services: a classical computing cluster and a web server.

show abstract

“…The loop then operates in one of two modes based on whether a good aggregation has been identified. If no aggregation has been selected, more levels are added until the real run time is significant enough to create overlap for the next aggregation(lines [19][20][21][22][23][24][25]. Once this happens, the current candidate is marked as a viable aggregation.…”

Section: Fig 5 the Peel Level Decision Proceduresmentioning

confidence: 99%

Batch queue resource scheduling for workflow applications

Yang

Koelbel

Cooper

2009

2009 IEEE International Conference on Cluster Computing and Workshops

View full text Add to dashboard Cite

Workflow computations have become a major programming paradigm for scientific applications. However, acquiring enough computational resources to execute a workflow poses a challenge in a batch queue controlled resource due to the space-sharing nature of the resource management policy. This paper introduces a scheduling technique that aggregates a workflow application into several subcomponents. It then uses the batch queue to acquire resources for each subcomponent, overlapping resource provisioning overhead (wait time) of one with the execution of others. We implemented a prototype of this technique and tested it using five high performance computing centers job submission logs. The results show that our approach can eliminate as much as 70% of the wait time over more traditional techniques that request resources for individual workflow nodes or that acquire all the resources for the whole workflow at once.

show abstract

Creating Personal Adaptive Clusters for Managing Scientific Jobs in a Distributed Computing Environment

Cited by 34 publications

References 7 publications

Middleware support for many-task computing

Middleware support for many-task computing

Elastic management of cluster-based services in the cloud

Batch queue resource scheduling for workflow applications

Contact Info

Product

Resources

About