Katarzyna Porada scite author profile

Katarzyna Porada

3Publications

12Citation Statements Received

85Citation Statements Given

How they've been cited

How they cite others

Affiliations

National University of Distance Education, University of Perpignan, Montpellier Laboratory of Informatics, Robotics and Microelectronics

Publications

Order By: Most citations

Toward a Core Design to Distribute an Execution on a Manycore Processor

Goossens

Parello

Porada

et al. 2015

View full text Add to dashboard Cite

Abstract. This paper presents a parallel execution model and a manycore processor design to run C programs in parallel. The model automatically builds parallel sections of machine instructions from the run trace. It parallelizes instructions fetches, renamings, executions and retirements. Predictor based fetch is replaced by a fetch-decode-and-partlyexecute stage able to compute in-order most of the control instructions. Tomasulo's register renaming is extended to memory with a technique to match consumer/producer pairs. The Reorder Buffer is adapted to allow parallel retirement. The model is presented on a sum reduction example which is also used to give a short analytical evaluation of the model performance potential.

show abstract

Computing on many cores

Goossens

Parello

Porada

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary This paper presents an alternative method to parallelize programs, better suited to manycore processors than actual operating system–/API‐based approaches like OpenMP and MPI. The method relies on a parallelizing hardware and an adapted programming style. It frees and captures the instruction‐level parallelism (ILP). A many‐core design is presented in which cores are multithreaded and able to fork new threads. The programming style is based on functions. The hardware creates a concurrent thread at each function call. The programming style and the hardware create the conditions to free the ILP, by eliminating the architectural dependences between a call and its continuation after return. We illustrate the method on a sum reduction, a matrix multiplication and a sort. We measure the ILP of the parallel runs and show that it is high enough to feed thousands of cores because it increases with data size. We compare our method to pthread parallelization, showing that (1) our parallel execution is deterministic, (2) our thread management is cheap, (3) our parallelism is implicit, and (4) our method parallelizes functions and loops. Implicit parallelism makes parallel code easy to write and read. Deterministic parallel execution makes parallel code easy to debug.

show abstract

Parallel Locality and Parallelization Quality

Goossens

Parello

Porada

et al. 2016

View full text Add to dashboard Cite

International audienceThis paper presents a new distributed computation model adapted to manycore processors. In this model, the run is spread on the available cores by fork machine instructions produced by the compiler , for example at function calls and loops iterations. This approach is to be opposed to the actual model of computation based on cache and predictor. Cache efficiency relies on data locality and predictor efficiency relies on the reproducibility of the control. Data locality and control reproducibility are less effective when the execution is distributed. The computation model proposed is based on a new core hardware. Its main features are described in this paper. This new core is the building block of a manycore design. The processor automatically parallelizes an execution. It keeps the computation deterministic by constructing a totally ordered trace of the machine instructions run. References are renamed, including memory , which fixes the communications and synchronizations needs. When a data is referenced, its producer is found in the trace and the reader is synchronized with the writer. This paper shows how a consumer can be located in the same core as its producer, improving parallel locality and parallelization quality. Our determin-istic and fine grain distribution of a run on a manycore processor is compared with OS primitives and API based parallelization (e.g. pthread, OpenMP or MPI) and to compiler automatic paralleliza-tion of loops. The former implies (i) a high OS overhead meaning that only coarse grain parallelization is cost-effective and (ii) a non deterministic behaviour meaning that appropriate synchronization to eliminate wrong results is a challenge. The latter is unable to fully parallelize general purpose programs due to structures like functions, complex loops and branches

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.