Toward a Core Design to Distribute an Execution on a Manycore Processor

Goossens, Bernard; Parello, David; Porada, Katarzyna; Rahmoune, Djallal

doi:10.1007/978-3-319-21909-7_38

Cited by 2 publications

(10 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To avoid complications in the trace building, the hardware in [5] computes the control instructions targets rather than predicting them. Computing is slower than predicting but computing tens of branches in parallel is more efficient than predicting tens of 1 Instruction Set Architecture branches in sequence, parallelism being more cost-effective than a sequential predictor, even a perfect one.…”

Section: A Deterministic and Parallel Run Of C Code 21 Deterministicmentioning

confidence: 99%

“…x86 register rsp), meaning that both paths use the same stack area. The hardware in [5] also copies rbp, rdi, rsi and rbx. These copies are better than push/pop because a push in a function prologue and a pop in its epilogue create RAW dependences between the epilogue and the prologue of the next function call, serializing them.…”

Section: The Fork Machine Instructionmentioning

confidence: 99%

“…In the hardware presented in [5] the trace is run in the partial order of its dependences. False register dependences are removed by renaming [13].…”

Section: Memory Renamingmentioning

confidence: 99%

“…However, a prediction based mechanism is not suited to eliminate false memory dependences. In [5], the memory hardware renaming is based on a search along the instruction trace total order.…”

Section: Memory Renamingmentioning

confidence: 99%

“…In this paper we propose a measure of the producer to consumer distance, which is a way to quantify the parallelization quality. In [5], a parallelizing core hardware is proposed to distribute an execution on a manycore processor. As the parallelization is dynamic, the three problems just mentioned are more easy to tackle.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Parallel Locality and Parallelization Quality

Goossens

Parello

Porada

et al. 2016

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores

Self Cite

View full text Add to dashboard Cite

International audienceThis paper presents a new distributed computation model adapted to manycore processors. In this model, the run is spread on the available cores by fork machine instructions produced by the compiler , for example at function calls and loops iterations. This approach is to be opposed to the actual model of computation based on cache and predictor. Cache efficiency relies on data locality and predictor efficiency relies on the reproducibility of the control. Data locality and control reproducibility are less effective when the execution is distributed. The computation model proposed is based on a new core hardware. Its main features are described in this paper. This new core is the building block of a manycore design. The processor automatically parallelizes an execution. It keeps the computation deterministic by constructing a totally ordered trace of the machine instructions run. References are renamed, including memory , which fixes the communications and synchronizations needs. When a data is referenced, its producer is found in the trace and the reader is synchronized with the writer. This paper shows how a consumer can be located in the same core as its producer, improving parallel locality and parallelization quality. Our determin-istic and fine grain distribution of a run on a manycore processor is compared with OS primitives and API based parallelization (e.g. pthread, OpenMP or MPI) and to compiler automatic paralleliza-tion of loops. The former implies (i) a high OS overhead meaning that only coarse grain parallelization is cost-effective and (ii) a non deterministic behaviour meaning that appropriate synchronization to eliminate wrong results is a challenge. The latter is unable to fully parallelize general purpose programs due to structures like functions, complex loops and branches

show abstract

Section: A Deterministic and Parallel Run Of C Code 21 Deterministicmentioning

confidence: 99%

Section: The Fork Machine Instructionmentioning

confidence: 99%

“…In the hardware presented in [5] the trace is run in the partial order of its dependences. False register dependences are removed by renaming [13].…”

Section: Memory Renamingmentioning

confidence: 99%

Section: Memory Renamingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Parallel Locality and Parallelization Quality

Goossens

Parello

Porada

et al. 2016

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores

Self Cite

View full text Add to dashboard Cite

show abstract

Computing on many cores

Goossens

Parello

Porada

et al. 2017

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Summary This paper presents an alternative method to parallelize programs, better suited to manycore processors than actual operating system–/API‐based approaches like OpenMP and MPI. The method relies on a parallelizing hardware and an adapted programming style. It frees and captures the instruction‐level parallelism (ILP). A many‐core design is presented in which cores are multithreaded and able to fork new threads. The programming style is based on functions. The hardware creates a concurrent thread at each function call. The programming style and the hardware create the conditions to free the ILP, by eliminating the architectural dependences between a call and its continuation after return. We illustrate the method on a sum reduction, a matrix multiplication and a sort. We measure the ILP of the parallel runs and show that it is high enough to feed thousands of cores because it increases with data size. We compare our method to pthread parallelization, showing that (1) our parallel execution is deterministic, (2) our thread management is cheap, (3) our parallelism is implicit, and (4) our method parallelizes functions and loops. Implicit parallelism makes parallel code easy to write and read. Deterministic parallel execution makes parallel code easy to debug.

show abstract

Toward a Core Design to Distribute an Execution on a Manycore Processor

Cited by 2 publications

References 15 publications

Parallel Locality and Parallelization Quality

Parallel Locality and Parallelization Quality

Computing on many cores

Contact Info

Product

Resources

About