CPU-based many-core processors present an alternative to multicore CPU and GPU processors. In particular, the 93-Petaflops Sunway supercomputer, built from clustered many-core processors, has opened a new era for high performance computing that does not rely on GPU acceleration. However, memory bandwidth remains the main challenge for these architectures. This motivates our endeavor for optimizing one of the most data-intensive kind of stencil computations, namely the three-dimensional applications of the lattice Boltzmann method (LBM). We propose optimizations on many-cores processors by using local memory and asynchronous software-prefetching on a representative 3D LBM solver as an example. We achieve 33 % performance gain on the Kalray MPPA-256 manycore processor by actively streaming data from/to local memory, compared to the "passive" OpenCL programming model. 1. h = 2 with the D3Q19 stencil.
Programming Multiprocessor Systems-on-Chips (MPSoCs) with hundreds of heterogeneous Processing Elements (PEs), complex memory architectures, and Networks-on-Chips (NoCs) remains a challenge for embedded system designers. Dataflow Models of Computation (MoCs) are increasingly used for developing parallel applications as their high-level of abstraction eases the automation of mapping, task scheduling and memory allocation onto MPSoCs. This paper introduces a technique for deploying hierarchical dataflow graphs efficiently onto MPSoC. The proposed technique exploits different granularity of dataflow parallelism to generate both NoC-based communications and nested OpenMP loops. Deployment of an image processing application on a many-core MPSoC results in speedups of up to 58.7 compared to the sequential execution.
The Fourier transform is the main processing step applied to data collected from the Square Kilometre Array (SKA) receivers. The requirement is to compute a Fourier transform of 2 19 real byte samples in real-time, while minimizing the power consumption. We address this challenge by optimizing a FFT implementation for execution on the Kalray MPPA manycore processor. Although this processor delivers high floating-point performances, we use fixed-point number representations in order to reduce the memory consumption and the I/O bandwidth. The result is an execution time of 1,07ms per FFT, including data transfers. This enables to use only two first-generation MPPA chips per flow of data coming from the receivers, for a total power consumption of 17.4W.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.