Implementation and performance analysis of parallel conjugate gradient on the Cell Broadband Engine

Sibai, Fadi N.; Kidwai, Hashir

doi:10.1147/jrd.2010.2071191

Cited by 3 publications

(4 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…All computational intensive elements in CG and ORTHOMIN (i.e. inner products and matrix vector operations) are trivially parallelizable [2], and as a result, researchers [2,6] have placed importance on analyzing the parallelization of sparse matrix-vector multiplication (SpMV), given its role in solving linear systems and eigenvalue problems that arise in scientific and engineering applications. The methods for efficiently manipulating sparse matrix structures are of utmost importance to the performance of many applications because they arise in numerous computational disciplines.…”

Section: ) Backgroundmentioning

confidence: 99%

“…Unlike the parallel Conjugate Gradient algorithm [2], in ORTHOMIN, because of large storage requirements, we opted for row-wise decomposition of the matrix, where PPE acts as a master, and notifies the SPEs to perform matrix vector products and the dot products (which consumes a significant amount of time to execute). During each loop, the PPE has to notify each SPE through the signaling API (see Figure 2), to fetch the corresponding input vector based on the signal type (32-bit data), perform the partial matrix vector and dot product operations and return the partial output vector to the main memory through DMA calls and notify PPE that the assigned task has been finished.…”

Section: ) Parallel Implementationmentioning

confidence: 99%

“…Parallelism is essential to reduce the execution time for large systems of linear equations with thousands of equations and thousands of unknown variables. In prior work [1][2], we parallelized and implemented the Conjugate Gradient solver on the STI (Sony, Toshiba, IBM) Cell Broadband Engine (Cell) platform [8][9][10]. The Cell processor is a multi-core processor with nine cores on a single chip.…”

Section: ) Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Parallel Orthomin Equation Solver on the Cell Broadband Engine

Sibai

Kidwai

2011

2011 Sixth International Symposium on Parallel Computing in Electrical Engineering

View full text Add to dashboard Cite

This paper presents our parallelization and implementation of the ORTHOMIN solver on the Cell Broadband Engine. The solution of linear systems of equations is one of the most central processing unit-intensive steps in many engineering and simulation applications and can greatly benefit from the multitude of SIMD-capable synergistic processor element (SPE) cores in the Cell processor. We report the serial ORTHOMIN implementation on the Cell's PowerPC processor element (PPE), and the parallelization and performance analysis of ORTHOMIN across 8 SPEs for Tridiagonal (1-D reservoir grid) and Heptadiagonal (3-D reservoir grid) matrices. Our implementation is shown to scale well with data size, and grid dimensionality.Parallel linear equation solvers are used in the solution of many engineering and simulation applications. These solvers find the solution of a system of linear equations taking advantage of parallel processing methods. Parallelism is essential to reduce the execution time for large systems of linear equations with thousands of equations and thousands of unknown variables. In prior work [1-2], we parallelized and implemented the Conjugate Gradient solver on the STI (Sony, Toshiba, IBM) Cell Broadband Engine (Cell) platform [8][9][10]. The Cell processor is a multi-core processor with nine cores on a single chip. One core is a PowerPC RISC processor known as the PPE (power processing element). Eight computational cores with vector capabilities are known as the SPEs (synergistic processing elements). With its multi-core parallelism and vector and SIMD processing capabilities, the STI Cell processor was shown to deliver top computation performance levels on graphics and image processing applications [3] and video surveillance applications [4]. Because of the Cell's unique architecture, shorter simulator run times are expected on the Cell platform compared to existing systems. Shorter simulation times translate into faster solutions (in terms of days, depending on the size of the problem) and allow for more simulation runs to be performed on the same hardware resources during fixed time durations.A key component of a simulator in engineering disciplines is a linear equation solver which computes the solution to a system of difference equations. ORTHOMIN is an example of minimal-residual method used in engineering simulations such as oil reservoir simulation. In ORTHOMIN, a series of vectors is generated for x(1), x(2), …, x(k+1) for k+1 <= n starting from an initial estimate of x(0), where n is the number linear equations or the number of unknown variables (unknowns). The linear system equation is of the form A X = B where X is the vector of unknowns, and A and B are scalar vectors. A is known as the coefficient matrix and X is known as the unknown variable matrix. The exact solution is obtained in n iterations assuming no round off errors occur, while an approximate solution is obtained in less than n iterations. The drawback of ORTHOMIN is that it requires a large memory which provides a huge bottleneck on ...

show abstract

Section: ) Backgroundmentioning

confidence: 99%

Section: ) Parallel Implementationmentioning

confidence: 99%

Section: ) Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Parallel Orthomin Equation Solver on the Cell Broadband Engine

Sibai

Kidwai

2011

2011 Sixth International Symposium on Parallel Computing in Electrical Engineering

View full text Add to dashboard Cite

show abstract

“…It is well known that the cost of the CG is dominated by the SpMV operation [34][35][36]. This is is exemplified in Figure 1.10 for the solution of the discrete Poisson equation on a mesh over a spherical domain with 400, 000 control volumes (CV) on a single CPU.…”

Section: Distribution Of Execution Timementioning

confidence: 99%

Heterogeneous parallel algorithms for computational fluid dynamics on unstructured meshes

Oyarzun Altamirano

View full text Add to dashboard Cite

Frontiers of computational fluid dynamics (CFD) are constantly expanding and eagerly demanding more computational resources. Currently, we are experiencing an rapid evolution in the high performance computing systems driven by power consumption constraints. New HPC nodes incorporate accelerators that are used as math co-processors for increasing the throughput and the FLOP per watt ratio. On the other hand, multi-core CPUs have turned into energy efficient system-on-chip architectures. By doing so, the main components of the node are fused and integrated into a single chip reducing the energy costs. Nowadays, several institutions and governments are investing in the research and development of different aspects of HPC that could lead to the next generations of supercomputers. This initiatives have entitled the problem as the exascale challenge. This goal can only be achieved by incorporating major changes in computer architecture, memory design and network interfaces. The CFD community faces an important challenge: keep the pace at the rapid changes in the HPC resources. The codes and formulations need to be re-design in other to exploit the different levels of parallelism and complex memory hierarchies of the new heterogeneous systems. The main characteristics demanded to the new CFD software are: memory awareness, extreme concurrency, modularity and portability. This thesis is devoted to the study of a CFD algorithm re-factoring for the adoption of new technologies. Our application context is the solution of incompressible flows (DNS or LES) on unstructured meshes. The first approach was using GPUs for accelerating the Poisson solver, that is the most computational intensive part of our application. The positive results obtained in this first step motivated us to port the complete time integration phase of our application. This requires a major redesign of the code. We propose a portable implementation model for CFD applications. The main idea was substituting stencil data structures and kernels by algebraic storage formats and operators. By doing so, the algorithm was restructured into a minimal set of algebraic operations. The implementation strategy consisted in the creation of a low-level algebraic layer for computations on CPUs and GPUs, and a high-level user-friendly discretization layer for CPUs that is fully localized at the preprocessing stage where performance does not play an important role. As a result, at the time-integration phase the code relies only on three algebraic kernels: sparse-matrix-vector product (SpMV), linear combination of two vectors (AXPY) and dot product (DOT). Such a simple set of basic linear algebra operations naturally provides the desired portability to any computing architecture. Special attention was paid at the development of data structures compatibles with the stream processing model. A detailed performance analysis was studied in both sequential and parallel execution engaging up to 128 GPUs in a hybrid CPU/GPU supercomputer. Moreover, we tested the portable implementation model of TermoFluids code in the Mont-Blanc mobile-based supercomputer. The re-design of the kernels exploits a heterogeneous execution model using both computing devices CPU and GPU of the ARM-based nodes. The load balancing between the two computing devices exploits a tabu search strategy that tunes the workload distribution during the preprocessing stage. A comparison of the Mont-Blanc prototypes with high-end supercomputers in terms of the achieved net performance and energy consumption provided some guidelines of the behavior of CFD applications in ARM-based architectures. Finally, we present a memory aware auto-tuned Poisson solver for problems with one Fourier diagonalizable direction. This work was developed and tested in the BlueGene/Q Vesta supercomputer, and aims at demonstrating the relevance of vectorization and memory awareness for fully exploiting the modern energy efficient CPUs. Las fronteras de la dinámica de fluidos computacional (CFD) están en constante expansión y demandan más y más recursos computacionales. Actualmente, estamos experimentando una evolución en los sistemas de computación de alto rendimiento (HPC) impulsado por restricciones de consumo de energía. Los nuevos nodos HPC incorporan aceleradores que se utilizan como co-procesadores para incrementar el rendimiento y la relación FLOP por vatio. Por otro lado, CPUs multi-core se han convertido en arquitecturas system-on-chip. Hoy en día, varias instituciones y gobiernos están invirtiendo en la investigación y desarrollo de los diferentes aspectos de HPC que podrían llevar a las próximas generaciones de superordenadores. Estas iniciativas han titulado el problema como el "exascale challenge". Este objetivo sólo puede lograrse mediante la incorporación de cambios importantes en: la arquitectura de ordenador, diseño de la memoria y las interfaces de red. La comunidad de CFD se enfrenta a un reto importante: mantener el ritmo a los rápidos cambios en las infraestructuras de HPC. Los códigos y formulaciones necesitan ser rediseñados para explotar los diferentes niveles de paralelismo y complejas jerarquías de memoria de los nuevos sistemas heterogéneos. Las principales características exigidas al nuevo software CFD son: estructuras de datos, la concurrencia extrema, modularidad y portabilidad. Esta tesis está dedicada al estudio de un modelo de implementation CFD para la adopción de nuevas tecnologías. Nuestro contexto de aplicación es la solución de los flujos incompresibles (DNS o LES) en mallas no estructuradas. El primer enfoque se basó en utilizar GPUs para acelerar el solver de Poisson. Los resultados positivos obtenidos en este primer paso nos motivaron a la portabilidad completa de la fase de integración temporal de nuestra aplicación. Esto requiere un importante rediseño del código. Proponemos un modelo de implementacion portable para aplicaciones de CFD. La idea principal es sustituir las estructuras de datos de los stencils y kernels por formatos de almacenamiento algebraicos y operadores. La estrategia de implementación consistió en la creación de una capa algebraica de bajo nivel para los cálculos de CPU y GPU, y una capa de discretización fácil de usar de alto nivel para las CPU. Como resultado, la fase de integración temporal del código se basa sólo en tres funciones algebraicas: producto de una matriz dispersa con un vector (SPMV), combinación lineal de dos vectores (AXPY) y producto escalar (DOT). Además, se prestó especial atención en el desarrollo de estructuras de datos compatibles con el modelo stream processing. Un análisis detallado de rendimiento se ha estudiado tanto en ejecución secuencial y paralela utilizando hasta 128 GPUs en un superordenador híbrido CPU / GPU. Por otra parte, hemos probado el nuevo modelo de TermoFluids en el superordenador Mont-Blanc basado en tecnología móvil. El rediseño de las funciones explota un modelo de ejecución heterogénea utilizando tanto la CPU y la GPU de los nodos basados en arquitectura ARM. El equilibrio de carga entre las dos unidades de cálculo aprovecha una estrategia de búsqueda tabú que sintoniza la distribución de carga de trabajo durante la etapa de preprocesamiento. Una comparación de los prototipos Mont-Blanc con superordenadores de alta gama en términos de rendimiento y consumo de energía nos proporcionó algunas pautas del comportamiento de las aplicaciones CFD en arquitecturas basadas en ARM. Por último, se presenta una estructura de datos auto-sintonizada para el solver de Poisson en problemas con una dirección diagonalizable mediante una descomposicion de Fourier. Este trabajo fue desarrollado y probado en la superordenador BlueGene / Q Vesta, y tiene por objeto demostrar la relevancia de vectorización y las estructuras de datos para aprovechar plenamente las CPUs de los superodenadores modernos.

show abstract