Ernie Chan scite author profile

SUMMARYWe discuss the design and high-performance implementation of collective communications operations on distributed-memory computer architectures. Using a combination of known techniques (many of which were first proposed in the 1980s and early 1990s) along with careful exploitation of communication modes supported by MPI, we have developed implementations that have improved performance in most situations compared to those currently supported by public domain implementations of MPI such as MPICH. Performance results from a large Intel Xeon/Pentium 4 (R) processor cluster are included.

show abstract

Programming matrix algorithms-by-blocks for thread-level parallelism

Quintana-Ortí

Quintana–Ort́ı

Geijn

et al. 2009

ACM Trans. Math. Softw.

116

128

View full text Add to dashboard Cite

With the emergence of thread-level parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of contiguous blocks, facilitating algorithms-by-blocks. Operand descriptions are registered for a particular operation a priori by the library implementor. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads out-of-order and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithms-by-blocks. We show how our recently proposed LU factorization with incremental pivoting and a closely related algorithm-by-blocks for the QR factorization, both originally designed for out-of-core computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest that high performance is abundantly achievable.

show abstract

Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

et al. 2007

View full text Add to dashboard Cite

We discuss the high-performance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multi-core processors with many cores. We argue that traditional implementations, as those incorporated in LAPACK, cannot be easily modified to render high performance as well as scalability on these architectures. The solution we propose is to arrange the data structures and algorithms so that matrix blocks become the fundamental units of data, and operations on these blocks become the fundamental units of computation, resulting in algorithms-byblocks as opposed to the more traditional blocked algorithms. We show that this facilitates the adoption of techniques akin to dynamic scheduling and out-of-order execution usual in superscalar processors, which we name SuperMatrix Out-of-Order scheduling. Performance results on a 16 CPU Itanium2-based server are used to highlight opportunities and issues related to this new approach.

show abstract

Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures

Quintana-Ortí¹,

Quintana–Ort́ı²,

Chan

et al. 2008

View full text Add to dashboard Cite

On optimizing collective communication

Chan

Heimlich

Purkayastha

et al.

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ernie Chan

Collective communication: theory, practice, and experience

Programming matrix algorithms-by-blocks for thread-level parallelism

Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures

On optimizing collective communication

Contact Info

Product

Resources

About