2012
DOI: 10.1145/2331130.2331133
|View full text |Cite
|
Sign up to set email alerts
|

A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures

Abstract: Registro de acceso restringido Este recurso no está disponible en acceso abierto por política de la editorial. No obstante, se puede acceder al texto completo desde la Universitat Jaume I o si el usuario cuenta con suscripción. Registre d'accés restringit Aquest recurs no està disponible en accés obert per política de l'editorial. No obstant això, es pot accedir al text complet des de la Universitat Jaume I o si l'usuari compta amb subscripció. Restricted access item This item isn't open access because of publ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
21
0

Year Published

2012
2012
2023
2023

Publication Types

Select...
4
1

Relationship

3
2

Authors

Journals

citations
Cited by 19 publications
(21 citation statements)
references
References 21 publications
0
21
0
Order By: Relevance
“…For simplicity, we will only consider GPU routines that operate with data residing in the main memory. For matrix decompositions such as the QR factorization and other similar Level-3 BLAS-based kernels, disk latency can be mostly hidden by overlapping it with computation, even in platforms equipped with GPU accelerators [14]. Therefore, we expect these results to carry over to the case where data is stored on disk.…”
Section: Resultsmentioning
confidence: 99%
“…For simplicity, we will only consider GPU routines that operate with data residing in the main memory. For matrix decompositions such as the QR factorization and other similar Level-3 BLAS-based kernels, disk latency can be mostly hidden by overlapping it with computation, even in platforms equipped with GPU accelerators [14]. Therefore, we expect these results to carry over to the case where data is stored on disk.…”
Section: Resultsmentioning
confidence: 99%
“…For that purpose, we consider a scalable biological case, MT (see Table I), generated using IMOD, and tune it to generate problems that do not fit into the GPU but are "small" enough for the main memory. For matrix decompositions such as the QR factorization and other similar Level-3 BLAS-based kernels, like those appearing in the compute-bounded eigensolvers, disk latency can be mostly hidden by overlapping it with computation, even in platforms equipped with GPU accelerators [19]. Therefore, we expect these results to carry over to an execution where the problem data matrices are stored on disk, and have to be transferred between secondary storage and the GPU memory.…”
Section: Performance Of Compute-bounded Solversmentioning
confidence: 99%
“…Adding one more layer in the memory hierarchy of these eigensolvers, so that the data reside on disk instead of main memory and are moved back and forth between there and the GPU, is conceptually equivalent. Furthermore, for compute-bound operations like those present in these two algorithms, in [16], we showed that the disk latency can be perfectly hidden via a careful organization of the data movements, analogous to that performed between the GPU and the main memory. Therefore, we can reasonably expect that the OOC-GPU eigensolvers maintain their performance when operating with data that effectively resides on disk.…”
mentioning
confidence: 93%
See 1 more Smart Citation
“…In a number of papers, we introduced the idea of having a sequential library, libflame, transparently mapping operations to multithreaded architectures (symmetric multiprocessing and/or multicore) [2] and/or multiaccelerator architectures (multi-GPU) [17] and even solving problems with data stored in disks [18]. The idea is to view blocks in the algorithm-by-blocks as a unit of data and operations with those blocks as units of computation (tasks).…”
Section: Separation Of Concerns: a Runtime-based Approach To Parallelmentioning
confidence: 99%