Communication Optimizations for Distributed-Memory X10 Programs

Barik, Rajkishore; Zhao, Jisheng; Grove, David; Peshansky, Igor; Budimlić, Zoran; Sarkar, Vivek

doi:10.1109/ipdps.2011.105

Cited by 20 publications

(17 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…VB language is object-oriented structured high-level programming language, and it is efficient and powerful for graphical interface work [5]. MATLAB is very smart and flexible to develop control Fig.…”

Section: Implementation Schemementioning

confidence: 99%

Virtual Digital Control Experimental System

Yu-min¹,

Ma²,

Meng³

et al. 2015

TOCSJ

View full text Add to dashboard Cite

Digital control experiments are important parts of electrical engineering course in modern electrical schools. A virtual digital control experimental system is developed for undergraduates to learn digital control principals and to practice online. Based on the digital control hardware experiment platform, the practical circuit model is developed. And some digital control algorithms are employed for virtual digital control operations, these algorithms include PID and some varies algorithms, Smith predictor algorithm, Dahlin algorithm and Kalman algorithm. The development of the Kalman filtering algorithm in the virtual experimental system is detailed. The experimental parameters of the plant and the controller can be flexible configured by the users in the terminal online, and the control results can be quickly displayed online. The virtual digital control experimental system is efficient for undergraduates to pre-practice the digital control experiments and to learn the control principals.

show abstract

Section: Implementation Schemementioning

confidence: 99%

Virtual Digital Control Experimental System

Yu-min¹,

Ma²,

Meng³

et al. 2015

TOCSJ

View full text Add to dashboard Cite

show abstract

“…Communication optimization [14] provides an important insight on APGAS-specific optimizations by proposing various means of optimizing X10 code to reduce communication overhead. For example, it presents optimizations to avoid unnecessary copy of objects by applying scalar replacement, object splitting, and several loop transformations.…”

Section: Related Workmentioning

confidence: 99%

“…The first is to reuse the rich set of existing features provided by ROSE for analysis, transformation, and optimization. We consider whether any of the features need to be modified to align with the APGAS model and whether there are any APGAS-specific optimization such as optimizations in communication [14] to perform. The other objective is to build full support of X10 in ROSE so that ROSE users can exploit X10 for testing, experimental and related purposes.…”

Section: Introductionmentioning

confidence: 99%

Optimization of x10 programs with ROSE compiler infrastructure

Horie

Takeuchi

Kawachiya

et al. 2015

Proceedings of the ACM SIGPLAN Workshop on X10

Self Cite

View full text Add to dashboard Cite

X10 is a Java-like programming language that introduces new constructs to significantly simplify scale-out programming based on the Asynchronous Partitioned Global Address Space (APGAS) programming model. The fundamental goal of X10 is to enable scalable, high-performance, high-productivity programming of large scale computer systems for both conventional numerically intensive HPC workloads and for emerging "Big Data" workloads. X10 is implemented via source-to-source compilation; the X10 compiler takes as input X10 programs, applies high-level transformations primarily targeting X10's APGAS constructs, and outputs either C++ or Java source code that is further compiled to yield an executable program. ROSE is a multi-lingual compiler infrastructure for optimizing HPC applications using source-to-source transformations. It supports widely used programming models for parallel and distributed computing and provides a rich set of optimizations for serial programming models. In this paper, we report our early experiences connecting the X10 and ROSE and compilers to enable X10 programs to benefit from ROSE's suite of optimizations. To demonstrate the applicability of our approach, we compiled the LULESH proxy application with the combined toolchain and obtained a 10% performance improvement.

show abstract

“…Another communication optimization targeting the X10 language [2] achieves message aggregation in distributed loops by using a technique called scalar replacement with loop invariant code motion. Here, the compiler copies all remote portions of a block-distributed array to each locale once before the loop.…”

Section: Related Workmentioning

confidence: 99%

Affine Loop Optimization Based on Modulo Unrolling in Chapel

Sharma

Smith

Koehler

et al. 2014

Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models

View full text Add to dashboard Cite

This paper presents modulo unrolling without unrolling (modulo unrolling WU), a method for message aggregation for parallel loops in message passing programs that use affine array accesses in Chapel, a Partitioned Global Address Space (PGAS) parallel programming language. Messages incur a non-trivial run time overhead, a significant component of which is independent of the size of the message. Therefore, aggregating messages improves performance. Our optimization for message aggregation is based on a technique known as modulo unrolling, pioneered by Barua [3], whose purpose was to ensure a statically predictable single tile number for each memory reference for tiled architectures, such as the MIT Raw Machine [18]. Modulo unrolling WU applies to data that is distributed in a cyclic or block-cyclic manner. In this paper, we adapt the aforementioned modulo unrolling technique to the difficult problem of efficiently compiling PGAS languages to message passing architectures. When applied to loops and data distributed cyclically or blockcyclically, modulo unrolling WU can decide when to aggregate messages thereby reducing the overall message count and runtime for a particular loop. Compared to other methods, modulo unrolling WU greatly simplifies the complex problem of automatic code generation of message passing code. It also results in substantial performance improvement compared to the non-optimized Chapel compiler.To implement this optimization in Chapel, we modify the leader and follower iterators in the Cyclic and Block Cyclic data distribution modules. Results were collected that compare the performance of Chapel programs optimized with modulo unrolling WU and Chapel programs using the existing Chapel data distributions. Data collected on a tenlocale cluster show that on average, modulo unrolling WU used with Chapel's Cyclic distribution results in 64 percent fewer messages and a 36 percent decrease in runtime for our suite of benchmarks. Similarly, modulo unrolling WU used with Chapel's Block Cyclic distribution results in 72 percent fewer messages and a 53 percent decrease in runtime.

show abstract

Communication Optimizations for Distributed-Memory X10 Programs

Cited by 20 publications

References 13 publications

Virtual Digital Control Experimental System

Virtual Digital Control Experimental System

Optimization of x10 programs with ROSE compiler infrastructure

Affine Loop Optimization Based on Modulo Unrolling in Chapel

Contact Info

Product

Resources

About