Rafael Rodríguez‐Sánchez scite author profile

Rafael Rodríguez‐Sánchez

5Publications

85Citation Statements Received

81Citation Statements Given

How they've been cited

How they cite others

Affiliations

Universidad Complutense de Madrid, Jaume I University, University of Castilla-La Mancha

Publications

Order By: Most citations

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization With Partial Pivoting

et al. 2019

View full text Add to dashboard Cite

We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first technique promotes worker sharing (WS) between the two tasks, allowing the threads of the task that completes first to be reallocated for use by the costlier task. The second technique allows a fast task to alert the slower task of completion, enforcing the early termination (ET) of the second task, and a smooth transition of the factorization procedure into the next iteration.The two mechanisms are instantiated via a new malleable thread-level implementation of the Basic Linear Algebra Subprograms (BLAS), and their benefits are illustrated via an implementation of the LU factorization with partial pivoting enhanced with look-ahead. Concretely, our experimental results on a six core Intel-Xeon processor show the benefits of combining WS+ET, reporting competitive performance in comparison with a taskparallel runtime-based solution.

show abstract

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

et al. 2016

View full text Add to dashboard Cite

Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications.In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric-static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important Email addresses: catalans@uji.es (Sandra Catalán), figual@ucm.es (Francisco D. Igual), mayo@uji.es (Rafael Mayo), rarodrig@uji.es (Rafael Rodríguez-Sánchez), quintana@uji.es (Enrique S. Quintana-Ortí)gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency.

show abstract

Reduction to Tridiagonal Form for Symmetric Eigenproblems on Asymmetric Multicore Processors

Alonso

Catalán

Herrero

et al. 2017

View full text Add to dashboard Cite

Multi-threaded dense linear algebra libraries for low-power asymmetric multicore processors

Catalán

Herrero

Igual

et al. 2018

Journal of Computational Science

View full text Add to dashboard Cite

Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numerical tools for many scientific and engineering applications. While there exist high performance implementations of the BLAS (and LAPACK) functionality for many current multi-threaded architectures, the adaption of these libraries for asymmetric multicore processors (AMPs) is still pending. In this paper we address this challenge by developing an asymmetry-aware implementation of the BLAS, based on the BLIS framework, and tailored for AMPs equipped with two types of cores: fast/power hungry versus slow/energy efficient. For this purpose, we integrate coarsegrain and fine-grain parallelization strategies into the library routines which, respectively, dynamically distribute the workload between the two core types and statically repartition this work among the cores of the same type.Our results on an ARM R big.LITTLE TM processor embedded in the Exynos 5422 SoC, using the asymmetry-aware version of the BLAS and a plain migration of the legacy version of LAPACK, experimentally assess the benefits, limitations, and potential of this approach.

show abstract

Static Versus Dynamic Task Scheduling of the Lu Factorization on ARM big. LITTLE Architectures

Catalán

Rodríguez‐Sánchez

Quintana‐Ortí

et al. 2017

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.