Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

Catalán, Sandra; Igual, Francisco D.; Mayo, Rafael; Rodríguez‐Sánchez, Rafael; Quintana–Ort́ı, Enrique S.

doi:10.1007/s10586-016-0611-8

Cited by 17 publications

(25 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…• In conclusion, compared with previous work [13,15], this paper demonstrates that, for the particular domain of DLA, it is possible to hide the difficulties intrinsic to dealing with an asymmetric architecture (e.g., workload balancing for performance, energy-aware mapping of tasks to cores, and criticality-aware scheduling) inside an asymmetry-aware implementation of the BLAS-3. As a consequence, our solution can refactor any conventional (asymmetry-agnostic) scheduler to exploit the task parallelism present in complex DLA operations.…”

Section: Introductionmentioning

confidence: 67%

“…These studies offered a few relevant insights that guided the parallelization of gemm (and also other Level-3 BLAS) on the ARM big.LITTLE architecture under the GTS software execution model. Concretely, the architecture-aware multi-threaded parallelization of gemm in [13] integrates the following three techniques:…”

Section: Data-parallel Libraries For Asymmetric Architecturesmentioning

confidence: 99%

“…To close this section, we emphasize that our work differs from [13,15] in that we address sophisticated DLA operations, with a rich hierarchy of task dependencies, by leveraging a conventional runtime scheduler in combination with a data-parallel asymmetryconscious implementation of the BLAS-3.…”

Section: Data-parallel Libraries For Asymmetric Architecturesmentioning

confidence: 99%

“…• For this particular factorization, we describe how to leverage the asymmetryoblivious task-parallel runtime scheduler in OmpSs, in combination with a dataparallel instance of the BLAS-3 (basic linear algebra subprograms) [12] in the BLIS library specifically designed for ARM big.LITTLE AMPs [13,14].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations

et al. 2017

Self Cite

View full text Add to dashboard Cite

Dealing with asymmetry in the architecture opens a plethora of questions from the perspective of scheduling task-parallel applications, and there exist early attempts to address this problem via ad-hoc strategies embedded into a runtime framework. In this paper we take a different path, which consists in addressing the complexity of the problem at the library level, via a few asymmetry-aware fundamental kernels, hiding the architecture heterogeneity from the task scheduler. For the specific domain of dense linear algebra, we show that this is not only possible but delivers much higher performance than a naive approach based on an asymmetry-oblivious scheduler. Furthermore, this solution also outperforms an ad-hoc asymmetry-aware scheduler furnished with sophisticated scheduling techniques.

show abstract

Section: Introductionmentioning

confidence: 67%

Section: Data-parallel Libraries For Asymmetric Architecturesmentioning

confidence: 99%

Section: Data-parallel Libraries For Asymmetric Architecturesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations

et al. 2017

Self Cite

View full text Add to dashboard Cite

show abstract

“…The approach parallelizes the nested five-loop organization of gemm at one or more levels (i.e., loops), taking into account the cache organization of the target platform, the granularity of the computations, and the risk of race conditions, among other factors. For the multicore processors targeted in this work, an efficient choice is to extract the parallelism from Loop 4 only [26] via, e.g., OpenMP.…”

Section: Matrix Multiplicationmentioning

confidence: 99%

Energy balance between voltage-frequency scaling and resilience for linear algebra routines on low-power multicore architectures

et al. 2018

Self Cite

View full text Add to dashboard Cite

Near Threshold Voltage (NTV) computing has been recently proposed as a technique to save energy, at the cost of incurring higher error rates including, among others, Silent Data Corruption (SDC). In this paper, we evaluate the energy efficiency of dense linear algebra routines using several low-power multicore processors and we analyze whether the potential energy reduction achieved when scaling the processor to operate at a low voltage compensates the cost of integrating a fault tolerance mechanism that tackles SDC. Our study targets algorithmic-based fault-tolerant versions of the dense matrix-vector and matrix(-matrix) multiplication kernels (gemv and gemm, respectively), using the BLIS framework, as well as an implementation of the LU factorization with partial pivoting built on top of gemm. Furthermore, we tailor the study for a number of representative 32-bit and 64-bit multicore processors from ARM that were specifically designed for energy efficiency.

show abstract

Programming parallel dense matrix factorizations with look-ahead and OpenMP

et al. 2019

Self Cite

View full text Add to dashboard Cite

We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multithreaded version of BLAS. This approach is also different from the more sophisticated runtime-assisted implementations, which decompose the operation into tasks and identify dependencies via directives and runtime support. Instead, our strategy attains high performance by explicitly embedding a static look-ahead technique into the DMF code, in order to overcome the performance bottleneck of the panel factorization, and realizing the trailing update via a cache-aware multi-threaded implementation of the BLAS. Although the parallel algorithms are specified with a highlevel of abstraction, the actual implementation can be easily derived from them, paving the road to deriving a high performance implementation of a considerable fraction of LAPACK functionality on any multicore platform with an OpenMP-like runtime.

show abstract

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

Cited by 17 publications

References 27 publications

Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations

Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations

Energy balance between voltage-frequency scaling and resilience for linear algebra routines on low-power multicore architectures

Programming parallel dense matrix factorizations with look-ahead and OpenMP

Contact Info

Product

Resources

About