Henrique C. Freitas scite author profile

h i g h l i g h t s• Programming for a manycore is challenging.• Limited memory and NoC are among the most important constraints of manycores. • For CPU-bound and mixed workloads, MPPA-256 achieves better performance than Xeon. • MPPA-256 consumes up to 13× less energy than embedded and general-purpose multicores. a b s t r a c t Until the last decade, performance of HPC architectures has been almost exclusively quantified by their processing power. However, energy efficiency is being recently considered as important as raw performance and has become a critical aspect to the development of scalable systems. These strict energy constraints guided the development of a new class of so-called light-weight manycore processors. This study evaluates the computing and energy performance of two well-known irregular NP-hard problems -the Traveling-Salesman Problem (TSP) and K-Means clustering -and a numerical seismic wave propagation simulation kernel -Ondes3D -on multicore, NUMA, and manycore platforms. First, we concentrate on the nontrivial task of adapting these applications to a manycore, specifically the novel MPPA-256 manycore processor. Then, we analyze their performance and energy consumption on those different machines. Our results show that applications able to fully use the resources of a manycore can have better performance and may consume from 3.8× to 13× less energy when compared to low-power and general-purpose multicore processors, respectively.

show abstract

A comprehensive performance evaluation of the BinLPT workload‐aware loop scheduler

Penna

Gomes

Castro

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Summary Workload‐aware loop schedulers were introduced to deliver better performance than classical loop scheduling strategies. However, they presented limitations such as inflexible built‐in workload estimators and suboptimal chunk scheduling. Targeting these challenges, we proposed previously a workload‐aware scheduling strategy called BinLPT, which relies on three features: (i) user‐supplied estimations of the workload of the loop; (ii) a greedy heuristic that adaptively partitions the iteration space in several chunks; and (iii) a scheduling scheme based on the Longest Processing Time (LPT) rule and on‐demand technique. In this paper, we present two new contributions to the state‐of‐the‐art. First, we introduce a multiloop support feature to BinLPT, which enables the reuse of estimations across loops. Based on this feature, we integrated BinLPT into a real‐world elastodynamics application, and we evaluated it running on a supercomputer. Second, we present an evaluation of BinLPT using simulations as well as synthetic and application kernels. We carried out this analysis on a large‐scale NUMA machine under a variety of workloads. Our results revealed that BinLPT is better at balancing the workloads of the loop iterations and this behavior improves as the algorithmic complexity of the loop increases. Overall, BinLPT delivers up to 37.15% and 9.11% better performance than well‐known loop scheduling strategies, for the application kernels and the elastodynamics simulation, respectively.

show abstract

On the Performance and Isolation of Asymmetric Microkernel Design for Lightweight Manycores

Penna¹,

Souto

Lima

et al. 2019

View full text Add to dashboard Cite

Multikernel operating systems (OSs) were introduced to match the architectural characteristics of lightweight manycores. While several multikernel OS designs are possible, in this work we argue on one that is structured in asymmetric microkernel instances. We deliver an open-source implementation of an OS kernel with these characteristics, and we provide a comprehensive assessment using a representative benchmark suite. Our results show that an asymmetric microkernel design is scalable and introduces at most 0.9% of performance interference in an application execution. Also, our results unveil co-design aspects between an OS kernel and the architecture of lightweight manycore, concerning the memory system and core grouping.

show abstract

CAP Bench: a benchmark suite for performance and energy evaluation of low‐power many‐core processors

Souza

Penna

Queiroz

et al. 2016

Concurrency and Computation

View full text Add to dashboard Cite

International audienceSUMMARY The constant need for faster and more energy-efficient processors has been stimulating the development of new architectures, such as low-power many-core architectures. Researchers aiming to study these architectures are challenged by peculiar characteristics of some components such as Networks-on-Chip and lack of specific tools to evaluate their performance. In this context, the goal of this paper is to present a benchmark suite to evaluate state-of-the-art low-power many-core architectures such as the Kalray MPPA-256 low-power processor, which features 256 compute cores in a single chip. The benchmark was designed and used to highlight important aspects and details that need to be considered when developing parallel applications for emerging low-power many-core architectures. As a result, this paper demonstrates that the benchmark offers a diverse suite of programs with regard to parallel patterns, job types, communication intensity and task load strategies, suitable for a broad understanding of performance and energy consumption of MPPA-256 and upcoming many-core architectures

show abstract

A Low-Cost Energy-Efficient Raspberry Pi Cluster for Data Mining Algorithms

Saffran

García

Souza

et al. 2017

View full text Add to dashboard Cite

Data mining algorithms are essential tools to extract information from the increasing number of large datasets, also called Big Data. However, these algorithms demand huge amounts of computing power to achieve reliable results. Although conventional High Performance Computing (HPC) platforms can deliver such performance, they are commonly expensive and power-hungry. This paper presents a study of an unconventional low-cost energy-efficient HPC cluster composed of Raspberry Pi nodes. The performance, power and energy efficiency obtained from this unconventional platform is compared with a well-known coprocessor used in HPC (Intel Xeon Phi) for two data mining algorithms: Apriori and K-Means. The experimental results showed that the Raspberry Pi cluster can consume up to 88.35% and 85.17% less power than Intel Xeon Phi when running Apriori and K-Means, respectively, and up to 45.51% less energy when running Apriori.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.