The Adapteva Epiphany many-core architecture comprises a 2D tiled mesh Network-on-Chip (NoC) of low-power RISC cores with minimal uncore functionality. It offers high computational energy efficiency for both integer and floating point calculations as well as parallel scalability. Yet despite the interesting architectural features, a compelling programming model has not been presented to date. This paper demonstrates an efficient parallel programming model for the Epiphany architecture based on the Message Passing Interface (MPI) standard. Using MPI exploits the similarities between the Epiphany architecture and a conventional parallel distributed cluster of serial cores. Our approach enables MPI codes to execute on the RISC array processor with little modification and achieve high performance. We report benchmark results for the threaded MPI implementation of four algorithms (dense matrix-matrix multiplication, N-body particle interaction, a five-point 2D stencil update, and 2D FFT) and highlight the importance of fast intercore communication for the architecture.
An observation in supercomputing in the past decade illustrates the transition of pervasive commodity products being integrated with the world's fastest system. Given today's exploding popularity of mobile devices, we investigate the possibilities for high performance mobile computing. Because parallel processing on mobile devices will be the key element in developing a mobile and computationally powerful system, this study was designed to assess the computational capability of a GPU on a low-power, ARM-based mobile device. The methodology for executing computationally intensive benchmarks on a handheld mobile GPU is presented, including the practical aspects of working with the existing Android-based software stack and leveraging the OpenCL-based parallel programming model. The empirical results provide the performance of an OpenCL Nbody benchmark and an auto-tuning kernel parameterization strategy. The achieved computational performance of the lowpower mobile Adreno GPU is compared with a quad-core ARM, an x86 Intel processor, and a discrete AMD GPU.
The Adapteva Epiphany many-core architecture comprises a 2D tiled mesh Network-on-Chip (NoC) of low-power RISC cores with minimal uncore functionality. It offers high computational energy efficiency for both integer and floating point calculations as well as parallel scalability. Yet despite the interesting architectural features, a compelling programming model has not been presented to date. This paper demonstrates an efficient parallel programming model for the Epiphany architecture based on the Message Passing Interface (MPI) standard. Using MPI exploits the similarities between the Epiphany architecture and a conventional parallel distributed cluster of serial cores. Our approach enables MPI codes to execute on the RISC array processor with little modification and achieve high performance. We report benchmark results for the threaded MPI implementation of four algorithms (dense matrix-matrix multiplication, N-body particle interaction, a five-point 2D stencil update, and 2D FFT) and highlight the importance of fast intercore communication for the architecture.
Techniques such as clinometry, stereoscopy, interferometry, and polarimetry are used for Digital Elevation Model (DEM) generation from Synthetic Aperture Radar (SAR) images. The choice of technique depends on the SAR configuration, the means used for image acquisition, and the relief type. The most popular techniques are interferometry for regions of high coherence and stereoscopy for regions such as steep forested mountain slopes. Stereo matching, which is finds the disparity map or correspondence points between two images acquired from different sensor positions, is a core process in stereoscopy. Additionally, automatic stereo processing, which involves stereo matching, is an important process in other applications including vision-based obstacle avoidance for unmanned air vehicles (UAVs), extraction of weak targets in clutter, and automatic target detection. Due to its high computational complexity, stereo matching has traditionally been, and continues to be, one of the most heavily investigated topics in computer vision. A stereo matching algorithm performs a subset of the following four steps: cost computation, cost (support) aggregation, disparity computation/optimization, and disparity refinement. Based on the method used for cost computation, the algorithms are classified into feature-, phase-, and area-based algorithms; and they are classified as local or global based on how they perform disparity computation/optimization. We present a comparative performance study of two pairs, i.e., four versions, of global stereo matching codes. Each pair uses a different minimization technique: a simulated annealing or graph cut algorithm. And, the codes of a pair differ in terms of the employed global cost function: absolute difference (AD) or a variation of normalized cross correlation (NCC). The performance comparison is in terms of execution time, the global minimum cost achieved, power and energy consumption, and the quality of generated output. The results of this preliminary study provide insights into the suitability and relative merits of these algorithms and cost functions for execution on field-deployable and on-board computer systems with size, weight, and power (SWaP) constraints. The results show that for 12 out of 14 instances the graph cut codes, compared to their simulated annealing counterparts provided a 35-85% improvement in energy consumption and, therefore, are promising candidates for use in fielddeployable and on-board systems.
The continuing miniaturization and parallelization of computer hardware has facilitated the development of mobile and field-deployable systems that can accommodate terascale processing within once prohibitively small size and weight constraints. General-purpose Graphics Processing Units (GPUs) are prominent examples of such terascale devices. Unfortunately, the added computational capability of these devices often comes at the cost of larger demands on power, an already strained resource in these systems. This study explores power versus performance issues for a workload that can take advantage of GPU capability and is targeted to run in field-deployable environments, i.e., Synthetic Aperture Radar (SAR). Specifically, we focus on the Image Formation (IF) computational phase of SAR, often the most compute intensive, and evaluate two different state-of-the-art GPU implementations of this IF method. Using real and simulated data sets, we evaluate performance tradeoffs for single-and double-precision versions of these implementations in terms of time-to-solution, image output quality, and total energy consumption. We employ fine-grain direct-measurement techniques to capture isolated power utilization and energy consumption of the GPU device, and use general and radarspecific metrics to evaluate image output quality. We show that double-precision IF can provide slight image improvement to low-reflective areas of SAR images, but note that the added quality may not be worth the higher power and energy costs associated with higher precision operations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.