An Overview of the Bluegene/L System Software Organization

Almási, George; Bellofatto, R.; Brunheroto, José R.; Caşcaval, Călin; Castaños, José G.; Crumley, P.; Erway, C. Christopher; Lieber, Derek; Martorell, Xavier; Moreira, José E.; Sahoo, Ramendra K.; Sanomiya, A.; Ceze, Luís; Strauß, Karin

doi:10.1142/s0129626403001513

Cited by 17 publications

(16 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There exist many research and production-quality compilers for SIMD extensions, including Intel's C++ compiler [27], IBM's XL C compiler for BlueGene/L supercomputers [3], a vectorizing extension to the SUIF compiler [46], Codeplay's VECTOR C compiler [10], and the SWAR compiler scc [14], [15].…”

Section: B Vectorizing Codes For Short Vector Simd Extensionsmentioning

confidence: 99%

Efficient Utilization of SIMD Extensions

et al. 2005

View full text Add to dashboard Cite

Abstract-This paper targets automatic performance tuning of numerical kernels in the presence of multi-layered memory hierarchies and SIMD parallelism. The studied SIMD instruction set extensions include Intel's SSE family, AMD's 3DNow!, Motorola's AltiVec, and IBM's BlueGene/L SIMD instructions.FFTW, ATLAS, and SPIRAL demonstrate that near-optimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tuning. These software packages generate and optimize ANSI C code and feed it into the target machine's general purpose C compiler to maintain portability.The scalar C code produced by performance tuning systems poses a severe challenge for vectorizing compilers. The particular code structure hampers automatic vectorization and thus inhibits satisfactory performance on processors featuring short vector extensions. This paper describes special purpose compiler technology that supports automatic performance tuning on machines with vector instructions. The work described includes (i) symbolic vectorization of DSP transforms, (ii) straight-line code vectorization for numerical kernels, and (iii) compiler backends for straight-line code with vector instructions.Methods from all three areas were combined with FFTW, SPIRAL, and ATLAS to optimize both for memory hierarchy and vector instructions. Experiments show that the presented methods lead to substantial speed-ups (up to 1.8 for two-way and 3.3 for four-way vector extensions) over the best scalar C codes generated by the original systems as well as roughly matching the performance of hand-tuned vendor libraries.

show abstract

Section: B Vectorizing Codes For Short Vector Simd Extensionsmentioning

confidence: 99%

Efficient Utilization of SIMD Extensions

et al. 2005

View full text Add to dashboard Cite

show abstract

“…The currently largest prototype of IBM's supercomputer line BlueGene/L [1] is DD1, a machine equipped with 8192 custom-made IBM PowerPC 440 FP2 processors (4096 two-way SMP chips), which achieves a Linpack performance of R max = 11.68 Tflop/s, i. e., 71 % of its theoretical peak performance of R peak = 16.38 Tflop/s. This performance ranks the prototype machine on position 4 of the Top 500 list (in June 2004).…”

Section: The Bluegene/l Supercomputermentioning

confidence: 99%

“…IBM's BlueGene/L supercomputers [1] are a new class of massively parallel systems that focus not only on performance but also on lower power consumption, smaller footprint, and lower cost compared to current supercomputer systems. Although several computing centers plan to install smaller versions of BlueGene/L, the most impressive system will be the originally proposed system at Lawrence Livermore National Laboratory (LLNL), planned to be in operation in 2005.…”

Section: Introductionmentioning

confidence: 99%

Automatically Tuned FFTs for BlueGene/L’s Double FPU

Franchetti

Král

Lorenz

et al. 2005

High Performance Computing for Computational Science - VECPAR 2004

View full text Add to dashboard Cite

Abstract. IBM is currently developing the new line of BlueGene/L supercomputers. The top-of-the-line installation is planned to be a 65,536 processors system featuring a peak performance of 360 Tflop/s. This system is supposed to lead the Top 500 list when being installed in 2005 at the Lawrence Livermore National Laboratory. This paper presents one of the first numerical kernels run on a prototype BlueGene/L machine. We tuned our formal vectorization approach as well as the Vienna MAP vectorizer to support BlueGene/L's custom two-way short vector SIMD "double" floating-point unit and connected the resulting methods to the automatic performance tuning systems Spiral and Fftw. Our approach produces automatically tuned high-performance FFT kernels for BlueGene/L that are up to 45 % faster than the best scalar Spiral generated code and up to 75 % faster than Fftw when run on a single BlueGene/L processor.

show abstract

“…It uses a hierarchical system software architecture to achieve unprecedented levels of scalability [10]. In this section we describe some of these features we exploit to achieve highly scalable parallel I/O and create an attractive platform for data-intensive computation.…”

Section: Blue Gene/l: a Parallel I/o Perspectivementioning

confidence: 99%

“…Each BG/L node has a personality structure which keeps its run-time configuration data [10]. We utilize the pset configuration information.…”

Section: The Pset Organizationmentioning

confidence: 99%