Multi-GPU Implementation of LU Factorization

Jia, Yulu; Łuszczek, Piotr; Dongarra, Jack

doi:10.1016/j.procs.2012.04.012

Cited by 12 publications

(6 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Algorithm 2 shows how the decision about compressing each message sent is calculated. This algorithm must be included in the application to be executed (see 4 in Figure 1). Initially, values stored in Compression Heuristics File are loaded in internal tables.…”

Section: Basic Architecture Of the Proposed Frameworkmentioning

confidence: 99%

See 1 more Smart Citation

SANComSim: A Scalable, Adaptive and Non-intrusive Framework to Optimize Performance in Computational Science Applications

Núñez

Filgueira

Merayo

2013

Procedia Computer Science

View full text Add to dashboard Cite

Parallel processing has become the most common solution for developing and executing scientific computing applications. Actually, the best way to obtain good performance ratios is to exploit parallelism in both processing and communications. Although the study of computational performance has historically involved CPU power, currently the CPU is not the only concern in the overall performance. Due to the underlying design of parallel applications, communication networks play a very important role in the field of computational science. Despite the fact that networks used in multicore clusters are fast and have low latency, the amount of transferred data may cause a bottleneck in the communication system, as communicationintensive, parallel applications spend a significant amount of their total execution time exchanging data between processes. Moreover, in most cases, several users are executing different parallel applications at the same time in the cluster.In this paper we present SANComSim, a Scalable, Adaptive and Non-intrusive framework, based on simulation techniques, for optimizing the performance of the network system to execute complex applications. The main objective of this framework is to apply run-time compression, to reduce the data sent through the network, in order to increase the overall system performance. The main features of SANComSim are: adaptability, to dynamically adapt to the current state of the system; portability, the framework is neither focused on a specific programming language nor a platform; non-intrusive, since this framework is based on simulation techniques, which does not require exclusive access of the entire cluster system; scalability, any parallel application, independently of the number of processed and computing nodes, can use this framework to improve performance in cluster systems.

show abstract

Section: Basic Architecture Of the Proposed Frameworkmentioning

confidence: 99%

“…The current trend is to use multicore clusters in order to increase the computation capability, thus allowing an increase in the number of processes per application. Examples of these applications can be found in many fields of computational science like MRI scan data [1], molecular dynamics [2], simulations [3] and mathematics [4].…”

Section: Introductionmentioning

confidence: 99%

SANComSim: A Scalable, Adaptive and Non-intrusive Framework to Optimize Performance in Computational Science Applications

Núñez

Filgueira

Merayo

2013

Procedia Computer Science

View full text Add to dashboard Cite

show abstract

“…Mapping the LU algorithm over the graphics processor core is not an easy task to do since this process depends on massive memory references which will not fit in the GPU's core memory due to its relatively small size introducing unnecessary delays in the operation. Another hybrid organization connecting 48 AMD CPUs and 4 Fermi GPUs was used in [5] and another by E. Agullo in [6] where Nvidia tesla GPUs and Fermi based GPUs were used to test their algorithm. [7].…”

Section: B Gpu Based Solutionmentioning

confidence: 99%

“…In addition to scalability issues, general purpose architectures such as multicores and many cores have inefficiencies which deviate the algorithm performances largely from the peak performances of the hardware. Massively parallel GPU's [4], [5] have this same problem in a much greater amount because of their architecture. Application Specific Instruction set Processors (ASIP) are used to implement an optimized architecture able to serve a group of applications from the same domain (i.e.…”

Section: Introductionmentioning

confidence: 99%

NOA'S-Arc: NISC based, optimized array scalable architecture

Hassan

Farag

Hanafy

2013

2013 IEEE 56th International Midwest Symposium on Circuits and Systems (MWSCAS)

View full text Add to dashboard Cite

Statically scheduled scientific computing problems represent a large set of problems which require intensive amount of computation. The common feature characteristics of this set of problems could be used to optimize an architecture, where the utilization exceeds 90% of the peak performance. The proposed architecture is an array of reconfigurable NISC (No Instruction Set Computer) processing elements (PE) connected by a reconfigurable NOC (Network On Chip). An optimized data path for a group of problems is suggested. The control of each PE is reconfigurable to customize for each application so as the NOC. The architecture is simulated using a tile of 64 PEs to run LU decomposition algorithm of a dense matrix, and the results show a performance of 177 GFLOPS, which outperforms the GPU NVIDIA 6800 & 7800 implementations and the OpenMP parallel programming multicore solution using an Intel core 2 quad cpu with four processors cores.

show abstract

“…We focus on the LU factorization because of the constraints related to the synchronization of the processes that are involved during the panel factorization. For this case, the load balancing problem has been well studied by several authors [20,19,15,12,18,14]. For most implementations, the main idea is to determine empirically the amount of work to assign to the different computational units, or perform some necessary adjustments depending on the problem size in order to keep CPUs busy.…”

Section: Introductionmentioning

confidence: 99%

Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs

Donfack

Tomov

Dongarra

2014

2014 IEEE International Parallel &Amp; Distributed Processing Symposium Workshops

Self Cite

View full text Add to dashboard Cite

Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU computing approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoids data transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer's characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on high-end hybrid CPU/GPU systems show that our dynamically balanced synchronization-avoiding LU is both multicore and GPU scalable. Comparisons with state-of-the-art libraries like MKL (for multicore) and MAGMA (for hybrid systems) are provided, demonstrating significant performance improvements. The approach is applicable to other linear algebra algorithms. The scheduling mechanisms and tuning models can be incorporated into respectively dynamic runtime systems/schedulers and autotuning frameworks for hybrid CPU/MIC/GPU architectures.

show abstract

Multi-GPU Implementation of LU Factorization

Cited by 12 publications

References 18 publications

SANComSim: A Scalable, Adaptive and Non-intrusive Framework to Optimize Performance in Computational Science Applications

SANComSim: A Scalable, Adaptive and Non-intrusive Framework to Optimize Performance in Computational Science Applications

NOA'S-Arc: NISC based, optimized array scalable architecture

Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs

Contact Info

Product

Resources

About