Simultaneous multithreading support in embedded distributed memory MPSoCs

Garibotti, Rafael; Ost, Luciano; Busseuil, Remi; kourouma, Mamady; Adeniyi-Jones, Chris; Sassatelli, Gilles; Robert, Michel

doi:10.1145/2463209.2488836

Cited by 14 publications

(12 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To incorporate DSM capabilities in the chosen platform [17], we apply a number of modifications summarized in Figure 2. The concerned parts are highlighted: microkernel, software stack allocation and RMA module.…”

Section: Modifications Of Application Runtime and Hardwarementioning

confidence: 99%

See 1 more Smart Citation

Efficient Embedded Software Migration towards Clusterized Distributed-Memory Architectures

Garibotti

Butko

Ost

et al. 2016

IEEE Trans. Comput.

Self Cite

View full text Add to dashboard Cite

Abstract-A large portion of existing multithreaded embedded sofware has been programmed according to symmetric shared memory platforms where a monolithic memory block is shared by all cores. Such platforms accommodate popular parallel programming models such as POSIX threads and OpenMP. However with the growing number of cores in modern manycore embedded architectures, they present a bottleneck related to their centralized memory accesses. This paper proposes a solution tailored for an efficient execution of applications defined with shared-memory programming models onto on-chip distributed-memory multicore architectures. It shows how performance, area and energy consumption are significantly improved thanks to the scalability of these architectures. This is illustrated in an open-source realistic design framework, including tools from ASIC to microkernel.

show abstract

Section: Modifications Of Application Runtime and Hardwarementioning

confidence: 99%

“…E consider the open-source and customizable NoCbased MPSoC platform [17] implemented at RTL level. A very interesting feature of this customizable multicore platform is its ability to enable the creation of clusters according to CSM design (see left-hand side of Figure 1).…”

mentioning

confidence: 99%

Efficient Embedded Software Migration towards Clusterized Distributed-Memory Architectures

Garibotti

Butko

Ost

et al. 2016

IEEE Trans. Comput.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The expanded threads on this core are put in an array T hdArr (line 2). For every pair of threads of this array, the collapsed-mode execution time is calculated using [10] and the cross-thread energy ratio using Equation 2. If the energy ratio is greater than the maximum ratio computed thus far, the maximum value is updated.…”

Section: Energy-aware Thread Collapsingmentioning

confidence: 99%

“…The operating point (v l , f l ) which results in the least positive slack is selected (line 8). The overall threadcentric energy improvement is determined (lines [10][11][12]. If this is < 1 (implying slowdown has a lower energy consumption than race-to-idle), (v l , f l ) is selected as the frequency of the core; else (v N l , f N l ) is selected.…”

Section: E Energy Optimization: Slowdown Vs Race-to-idlementioning

confidence: 99%

The Slowdown or Race-to-idle Question: Workload-Aware Energy Optimization of SMT Multicore Platforms under Process Variation

Das

Merrett

Al-Hashimi

2016

Proceedings of the 2016 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE)

View full text Add to dashboard Cite

Abstract-Two widely used approaches for reducing energy consumption in multithreaded workloads are slowdown (using DVFS) and race-to-idle. In this paper, we first demonstrate that most energy-efficient choice is dependent on (1) workload (memory bound, CPU bound etc.), (2) process variation and (3) support for Simultaneous Multithreading (SMT). We then propose an approach for mapping application threads on SMT multicore systems at run-time, to minimize energy consumption. The proposed approach interfaces with the OS and hardware performance counters to characterize application threads. This characterization captures the effect of process variation on execution time and identifies the break-even operating point, where one strategy (slowdown or race-to-idle) outperforms the other. Thread mapping is performed using these characterized data by iteratively collapsing application threads (SMT) followed by binary programming-based thread mapping. Finally, performance slack is exploited at run-time to select between slowdown and race-to-idle, based upon the break-even operating point calculated for each individual thread. This end-to-end approach is implemented as a run-time manager for the Linux OS and is validated across a range of high performance applications. Results demonstrate up to 13% energy reduction over all state-of-the-art approaches, with an average of 18% improvement over Linux.

show abstract

“…The collective communication functions defined in the MPI library convert into a set of point-to-point communication functions by the MPI library cell so as to provide the ease of programming. As the collective communication functions account for up to 80% of the data transmission latency, it is very important to improve the process of handling these functions [3,4,5].…”

Section: Introductionmentioning

confidence: 99%

Enhancing MPI performance using atomic pipelined message broadcast in a distributed memory MPSoC

Park

Yun

Moon³

2014

IEICE Electron. Express

View full text Add to dashboard Cite

We propose a scheme for enhancing the MPI (Message Passing Interface) broadcast function performance with supporting hardware logic to reduce a number of synchronization processes which can be avoided on a distributed memory multiprocessor system on a chip (MPSoC). We accomplish this using the concept of atomic execution which facilitates full pipeline utilization. To validate our approach, we implemented a bus functional model with systemC and evaluated the results against various message data sizes and number of nodes. Evaluation results showed that performance improvement can be achieved by up to 230% over the precedent pipelined message broadcast method. Synthesis results with 4 processing nodes show that the extra hardware cost for the proposed atomic pipelined broadcast logic occupies only 2.4% of the entire area.

show abstract

Simultaneous multithreading support in embedded distributed memory MPSoCs

Cited by 14 publications

References 7 publications

Efficient Embedded Software Migration towards Clusterized Distributed-Memory Architectures

Efficient Embedded Software Migration towards Clusterized Distributed-Memory Architectures

The Slowdown or Race-to-idle Question: Workload-Aware Energy Optimization of SMT Multicore Platforms under Process Variation

Enhancing MPI performance using atomic pipelined message broadcast in a distributed memory MPSoC

Contact Info

Product

Resources

About