Charles H. Romine scite author profile

The standard formulation of the conjugate gradient algoritlml involves two inner product computations. The results of these two inner products are needed to update the search direction and the computed solution. In a distributed memory parallel environment, the computation and subsela quent distribution of these two values requires two scparatc communication , and synchronization phases. In this paper, we present a. mathematically equivalent rearrangement of the standard algorithm that reduces the number of communication phases. We give a second derivation of the modified conjugate gradient algorithm in terms of the natural relationship with the underlying Lanczos process. We also present empirical evidence of the stability of this modified algorithm.

show abstract

Modified Cyclic Algorithms for Solving Triangular Systems on Distributed-Memory Multiprocessors

Eisenstat

Heath²,

Henkel³

et al. 1988

SIAM J. Sci. and Stat. Comput.

View full text Add to dashboard Cite

$LU$ Factorization Algorithms on Distributed-Memory Multiprocessor Architectures

Geist¹,

Romine²

1988

SIAM J. Sci. and Stat. Comput.

View full text Add to dashboard Cite

In this paper, we consider the effect that the data-storage scheme and pivoting scheme have on the efficiency of LU factorization on a distributed-memory multiprocessor. Our presentation will focus on the hypercube architecture, but most of our results are applicable to distributed-memory architectures in general. We restrict our attention to two commonly used storage schemes (storage by rows and by columns) and investigate partial pivoting both by rows and by columns, yielding four factorization algorithms. Our goal is to determine which of these four algorithms admits the most efficient parallel implementation. We analyze factors such as load distribution, pivoting cost, and potential for pipelining. We conclude that, in the absence of loop-unrolling, LU factorization with partial pivoting is most efficient when pipelining is used to mask the cost of pivoting. The two schemes that can be pipelined are pivoting by interchanging rows when the coefficient matrix is distributed to the processors by columns, and pivoting by interchanging columns when the matrix is distributed to the processors by rows. Key words, parallel algorithms, distributed-memory multiprocessors, LU factorization, Gaussian elimination, hypercube AMS(MOS) subject classifications. 65F, 65W 1. Introduction. This paper describes four approaches for implementing LU factorization on a distributed-memory multiprocessor, specifically a hypercube. Our goal is to determine whether the choice of storage scheme for the coefficient matrix and pivoting strategy appreciably affects the efficiency of parallel factorization and, if so, which of the four algorithms is to be preferred. The empirical results presented in the sequel were obtained by implementing the factorization algorithms on an Intel iPSC hypercube. A number of papers have appeared in recent years describing various approaches to parallelizing LU factorization, including Davis [4], Chamberlain [2], and Geist [7]. The present work is motivated primarily by Geist and Heath [8] and Chu and George [3]. In most of these earlier papers, row storage for the coefficient matrix was chosen principally because no efficient parallel algorithms were then known to exist for the subsequent triangular solutions if the coefficient matrix was stored by columns. Recently, Romine and Ortega [16], Romine [15], Li and Coleman [11] [12], and Heath and Romine [10] have demonstrated such algorithms, removing triangular solutions as a reason for preferring row storage. In addition, if the coefficient matrix is stored by rows then pivoting by interchanging rows involves extra communication, since the elements which must be searched .are scattered among the processors. With column storage, no additional communication is required. Hence, column storage for the coefficient matrix warrants further investigation. One alternative method that has been suggested for the solution of linear systems on distributed-memory multiprocessors is QR factorization (see Ortega and Voigt [14]). QR factorization is inherently stable and thus a...

show abstract

Computing connection coefficients of compactly supported wavelets on bounded intervals

Romine

Peyton

1997

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Charles H. Romine

Parallel Solution of Triangular Systems on Distributed-Memory Multiprocessors

Reducing communication costs in the conjugate gradient algorithm on distributed memory multiprocessors

Modified Cyclic Algorithms for Solving Triangular Systems on Distributed-Memory Multiprocessors

$LU$ Factorization Algorithms on Distributed-Memory Multiprocessor Architectures

Computing connection coefficients of compactly supported wavelets on bounded intervals

Contact Info

Product

Resources

About