A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures

Quintana-Ortí, Gregorio; Igual, Francisco D.; Marqués, Mercedes; Quintana–Ort́ı, Enrique S.; Geijn, Robert A.

doi:10.1145/2331130.2331133

Cited by 19 publications

(21 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For simplicity, we will only consider GPU routines that operate with data residing in the main memory. For matrix decompositions such as the QR factorization and other similar Level-3 BLAS-based kernels, disk latency can be mostly hidden by overlapping it with computation, even in platforms equipped with GPU accelerators [14]. Therefore, we expect these results to carry over to the case where data is stored on disk.…”

Section: Resultsmentioning

confidence: 99%

Out-of-Core Solution of Eigenproblems for Macromolecular Simulations

Aliaga

Davidović

Quintana–Ort́ı

2014

Parallel Processing and Applied Mathematics

Self Cite

View full text Add to dashboard Cite

Abstract. We consider the solution of large-scale eigenvalue problems that appear in the motion simulation of complex macromolecules on desktop platforms. To tackle the dimension of the matrices that are involved in these problems, we formulate out-of-core (OOC) variants of the two selected eigensolvers, that basically decouple the performance of the solver from the storage capacity. Furthermore, we contend with the high computational complexity of the solvers by off-loading the arithmeticallyintensive parts of the algorithms to a hardware graphics accelerator.

show abstract

Section: Resultsmentioning

confidence: 99%

Out-of-Core Solution of Eigenproblems for Macromolecular Simulations

Aliaga

Davidović

Quintana–Ort́ı

2014

Parallel Processing and Applied Mathematics

Self Cite

View full text Add to dashboard Cite

show abstract

“…For that purpose, we consider a scalable biological case, MT (see Table I), generated using IMOD, and tune it to generate problems that do not fit into the GPU but are "small" enough for the main memory. For matrix decompositions such as the QR factorization and other similar Level-3 BLAS-based kernels, like those appearing in the compute-bounded eigensolvers, disk latency can be mostly hidden by overlapping it with computation, even in platforms equipped with GPU accelerators [19]. Therefore, we expect these results to carry over to an execution where the problem data matrices are stored on disk, and have to be transferred between secondary storage and the GPU memory.…”

Section: Performance Of Compute-bounded Solversmentioning

confidence: 99%

“…Adding one more layer in the memory hierarchy of these eigensolvers, so that the data reside on disk instead of main memory and are moved back and forth between there and the GPU, is conceptually equivalent. Furthermore, for compute-bound operations like those present in these two algorithms, in [16], we showed that the disk latency can be perfectly hidden via a careful organization of the data movements, analogous to that performed between the GPU and the main memory. Therefore, we can reasonably expect that the OOC-GPU eigensolvers maintain their performance when operating with data that effectively resides on disk.…”

mentioning

confidence: 93%

“…rates observed for the implementations evaluated in [2]. Following the conclusions from [19], and due to the nature of the operations underlying these two algorithms (basically, orthogonal factorizations and other simpler level-3 BLAS kernels), in these two cases we assume that these performance rates can be maintained when the data is on disk instead of main memory, and estimate the execution time on much larger problems using these values.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Out‐of‐core macromolecular simulations on multithreaded architectures

Aliaga

Badía

Dolz

et al. 2014

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

SUMMARYWe address the solution of large-scale eigenvalue problems that appear in the motion simulation of complex macromolecules on multithreaded platforms, consisting of (one or more) multicore processors and possibly a graphics processor (GPU). In particular, we compare specialized implementations of three high-performance eigensolvers that rely on disk storage and out-of-core (OOC) techniques to tackle the large memory requirements of these biological problems, which in general do not fit into the main memory of current desktop machines. Two of the OOC eigensolvers are composed of compute-bounded operations and we enhance their performance by leveraging hybrid CPU-GPU routines that off-load the arithmetically-intensive parts of the algorithms to a GPU accelerator. The third OOC eigesolver is a memory-bounded algorithm, which strongly constrains its performance when the data is on disk. However, this eigensolver exhibits a much lower arithmetic cost compared with its compute-bounded alternatives for this particular application. Experimental results on a desktop platform with two Intel Xeon multicore processors and an NVIDIA "Fermi" GPU, representative of current server technology, illustrate the potential of these methods to address the simulation of biological activity.

show abstract

“…In a number of papers, we introduced the idea of having a sequential library, libflame, transparently mapping operations to multithreaded architectures (symmetric multiprocessing and/or multicore) [2] and/or multiaccelerator architectures (multi-GPU) [17] and even solving problems with data stored in disks [18]. The idea is to view blocks in the algorithm-by-blocks as a unit of data and operations with those blocks as units of computation (tasks).…”

Section: Separation Of Concerns: a Runtime-based Approach To Parallelmentioning

confidence: 99%

Scheduling algorithms‐by‐blocks on small clusters

Igual

Quintana-Ortí

Geijn

2012

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

SUMMARY The arrival of multicore architectures has generated an interest in reformulating dense matrix computations as algorithms‐by‐blocks, where submatrices are units of data and computations with those blocks are units of computation. Rather than directly executing such an algorithm, a directed acyclic graph is generated at runtime that is then scheduled by a runtime system such as SuperMatrix. The benefit is a clear separation of concerns between the library and the heuristics for scheduling. In this paper, we show that this approach can be taken one step further using the same methodology and an ad hoc runtime to map algorithms‐by‐blocks to small clusters. With no change to the library code, and the application that uses it, the computational power of such small clusters can be utilized. An impressive performance on a number of small clusters is reported. As a proof of the flexibility of the solution, we report performance results on accelerated clusters based on graphics processors. We believe this to be a possible step towards programming many‐core architectures, as demonstrated by a port of the solution to Intel's Single‐chip Cloud Computer (Intel, Santa Clara, CA, USA). Copyright © 2012 John Wiley & Sons, Ltd.

show abstract

A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures

Cited by 19 publications

References 21 publications

Out-of-Core Solution of Eigenproblems for Macromolecular Simulations

Out-of-Core Solution of Eigenproblems for Macromolecular Simulations

Out‐of‐core macromolecular simulations on multithreaded architectures

Scheduling algorithms‐by‐blocks on small clusters

Contact Info

Product

Resources

About