Micro benchmark analysis of the KSR1

“…In other words, we use a block size that will consume half the entries in the TLB. We determine the TLB size for a given system by running a microbenchmark developed by Saavedra et al [17]. The page size is determined by getpagesize command.…”

Section: Choosing a Block Sizementioning

confidence: 99%

Improving the performance of MPI derived datatypes by optimizing memory-access cost

Byna

Gropp

Sun

et al. 2003

Proceedings IEEE International Conference on Cluster Computing CLUSTR-03

View full text Add to dashboard Cite

The IntroductionThe MPI (Message Passing Interface) Standard is widely used in parallel computing for writing distributedmemory parallel programs [1,2]. MPI has a number of features that provide both convenience and high performance. One of the important features is the concept of derived datatypes. Derived datatypes enable users to describe noncontiguous memory layouts compactly and to use this compact representation in MPI communication functions. Derived datatypes also enable an MPI implementation to optimize the transfer of noncontiguous data. For example, if the underlying communication mechanism supports noncontiguous data transfers, the MPI implementation can communicate the data directly without packing it into a contiguous buffer. On the other hand, if packing into a contiguous buffer is necessary, the MPI implementation can pack the data and send it contiguously. In practice, however, many MPI implementations perform poorly with derived datatypes-to the extent that users often resort to packing the data manually into a contiguous buffer and then calling MPI. Such usage clearly defeats the purpose of having derived datatypes in the MPI Standard. Since noncontiguous communication occurs commonly in many applications (for example, fast Fourier transform, array redistribution, and finite-element codes), improving the performance of derived datatypes has significant value.The performance of derived datatypes can be improved in two ways. One way is to improve the data structures used to store derived datatypes internally in the MPI implementation, so that, in an MPI communication call, the implementation can quickly decode the information represented by the datatype. Research has already been done in this area, mainly in using data structures that allow a stack-based approach to parsing a datatype, rather than making recursive function calls, which are expensive [3,4]. Another area for improvement is to use optimized algorithms for packing noncontiguous data into a contiguous buffer in a way that the user could not do easily without advanced knowledge of the memory architecture. This latter area is the focus of this paper. To our knowledge, no other MPI implementations use memory-optimization techniques for packing noncontiguous data in their derived-datatype code (for example, see the results with IBM's MPI in Figure 8).Interprocess communication can be considered as a combination of memory communication and network communication as defined in [5]. Memory communication (or memory copying) is the transfer of data from the user's buffer to the local network buffer (or shared-memory buffer) and vice versa. Network communication is the movement of data between source

show abstract

“…8,9 We used microbenchmarks to verify not only la- . tencies but also design parameters such as cache size, block size, and organization.…”

Section: Configurationmentioning

confidence: 99%

Rapid hardware prototyping on RPM-2

Dubois

Jeong

Song³

et al. 1998

IEEE Des. Test. Comput.

View full text Add to dashboard Cite

TODAY, THERE ARE MANY competingideas about how to implement multiprocessor systems. Although some of these ideas have been prototyped in hardware, hardware prototypes take too long to build and are very expensive. Often, by the time a hardware prototype really works, it is obsolete. First, the prototype's absolute speed is no longer on a par with current hardware. Second, the technology trade-offs among components change, so that performance results obtained on the prototype become meaningless. Third, the new architecture ideas embodied in the prototype may become irrelevant. Moreover, hardware prototypes are often hard to observe. By contrast, software simulations are very flexible, observable, and relatively inexpensive to develop. However, software simulations often force a trade-off between speed and realism.Hardware emulation using FPGAs (fieldprogrammable gate arrays) 1 is an intermediate approach between software simulation and hardware prototyping. We adopted this approach in a multiprocessor emulator called RPM (Rapid Prototyping Engine for Multiprocessor Systems). Because of its flexibility, the RPM hardware can adapt during its lifetime to the rapid evolution of technology trade-offs and new architectural ideas. RPM is also much more observable than typical hardware prototypes. RPM-2, the second RPM implementation, is up and running. Our first RPM-2 prototype is a cachecoherent nonuniform memory-access (CC-NUMA) multiprocessor. 2

show abstract

Micro benchmark analysis of the KSR1

Cited by 22 publications

References 10 publications

Global arrays: A nonuniform memory access programming model for high-performance computers

Global arrays: A nonuniform memory access programming model for high-performance computers

Improving the performance of MPI derived datatypes by optimizing memory-access cost

Rapid hardware prototyping on RPM-2

Contact Info

Product

Resources

About