This manuscript has been reproduced from the microfihn master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter &ce, while others may be from any type of computer printer.The qualiQr of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction.In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book.Step 2: Do experimental measurements of sample cases to determine model parameters and also the time for computation, communication, accessing the memory, and the miscellaneous time for auxiliary instructions.Step 3: Select the template for regression analysis to estimate the miscellaneous over head time. Determine the regression coefficients bcised on experimentally measured values. The regression formula for miscellaneous overhead time is denoted by fmisc-Step 4: Based on the experimental measurements, modify the analytical expressions fmem and fcomm SO that the predictions match with experimental timings deter mined in Step 2. The modifications to fmem are done to taJce into account cache effects and overlap of memory accesses with other operations. The modifications to fcomm axe done to take into account overlap of communication with computation.Step 5: Finally, the following formula is obtained to predict the execution time:fcomp "1" fcomm "h fmisc "t" fmem
DetailsThe analytical formulas are given for the three parallel algorithms in .A.ppendix A.In analyzing practical scenarios for parallel machines, the lower order terms can be significant. These formulas are carefully derived by examining the parallel algorithm to capture all its essential details. The formulcis are complex, but the advantage is that the performance predictions are very accurate.The three algorithms used in the study are well-known. The LU decomposition is de scribed in [18]. The details of the FFT algorithm can be found in [13]. Cannon's parallel algorithm is described in [41]. The LU decomposition uses a 2-D scattered data layout for the coefficient matrix (see section 2.3.5.1), and it includes partial pivoting. Different communication patterns are used by the three algorithms. The matrix multiplication 9 uses neaxest-neighbor communication where elements are shifted from one processor to the next along either a row or a column with wrap-around at the end. In case of the LU decomposition, communication is need...