In this paper, we consider the effect that the data-storage scheme and pivoting scheme have on the efficiency of LU factorization on a distributed-memory multiprocessor. Our presentation will focus on the hypercube architecture, but most of our results are applicable to distributed-memory architectures in general. We restrict our attention to two commonly used storage schemes (storage by rows and by columns) and investigate partial pivoting both by rows and by columns, yielding four factorization algorithms. Our goal is to determine which of these four algorithms admits the most efficient parallel implementation. We analyze factors such as load distribution, pivoting cost, and potential for pipelining. We conclude that, in the absence of loop-unrolling, LU factorization with partial pivoting is most efficient when pipelining is used to mask the cost of pivoting. The two schemes that can be pipelined are pivoting by interchanging rows when the coefficient matrix is distributed to the processors by columns, and pivoting by interchanging columns when the matrix is distributed to the processors by rows. Key words, parallel algorithms, distributed-memory multiprocessors, LU factorization, Gaussian elimination, hypercube AMS(MOS) subject classifications. 65F, 65W 1. Introduction. This paper describes four approaches for implementing LU factorization on a distributed-memory multiprocessor, specifically a hypercube. Our goal is to determine whether the choice of storage scheme for the coefficient matrix and pivoting strategy appreciably affects the efficiency of parallel factorization and, if so, which of the four algorithms is to be preferred. The empirical results presented in the sequel were obtained by implementing the factorization algorithms on an Intel iPSC hypercube. A number of papers have appeared in recent years describing various approaches to parallelizing LU factorization, including Davis [4], Chamberlain [2], and Geist [7]. The present work is motivated primarily by Geist and Heath [8] and Chu and George [3]. In most of these earlier papers, row storage for the coefficient matrix was chosen principally because no efficient parallel algorithms were then known to exist for the subsequent triangular solutions if the coefficient matrix was stored by columns. Recently, Romine and Ortega [16], Romine [15], Li and Coleman [11] [12], and Heath and Romine [10] have demonstrated such algorithms, removing triangular solutions as a reason for preferring row storage. In addition, if the coefficient matrix is stored by rows then pivoting by interchanging rows involves extra communication, since the elements which must be searched .are scattered among the processors. With column storage, no additional communication is required. Hence, column storage for the coefficient matrix warrants further investigation. One alternative method that has been suggested for the solution of linear systems on distributed-memory multiprocessors is QR factorization (see Ortega and Voigt [14]). QR factorization is inherently stable and thus a...