The
IntroductionThe MPI (Message Passing Interface) Standard is widely used in parallel computing for writing distributedmemory parallel programs [1,2]. MPI has a number of features that provide both convenience and high performance. One of the important features is the concept of derived datatypes. Derived datatypes enable users to describe noncontiguous memory layouts compactly and to use this compact representation in MPI communication functions. Derived datatypes also enable an MPI implementation to optimize the transfer of noncontiguous data. For example, if the underlying communication mechanism supports noncontiguous data transfers, the MPI implementation can communicate the data directly without packing it into a contiguous buffer. On the other hand, if packing into a contiguous buffer is necessary, the MPI implementation can pack the data and send it contiguously. In practice, however, many MPI implementations perform poorly with derived datatypes-to the extent that users often resort to packing the data manually into a contiguous buffer and then calling MPI. Such usage clearly defeats the purpose of having derived datatypes in the MPI Standard. Since noncontiguous communication occurs commonly in many applications (for example, fast Fourier transform, array redistribution, and finite-element codes), improving the performance of derived datatypes has significant value.The performance of derived datatypes can be improved in two ways. One way is to improve the data structures used to store derived datatypes internally in the MPI implementation, so that, in an MPI communication call, the implementation can quickly decode the information represented by the datatype. Research has already been done in this area, mainly in using data structures that allow a stack-based approach to parsing a datatype, rather than making recursive function calls, which are expensive [3,4]. Another area for improvement is to use optimized algorithms for packing noncontiguous data into a contiguous buffer in a way that the user could not do easily without advanced knowledge of the memory architecture. This latter area is the focus of this paper. To our knowledge, no other MPI implementations use memory-optimization techniques for packing noncontiguous data in their derived-datatype code (for example, see the results with IBM's MPI in Figure 8).Interprocess communication can be considered as a combination of memory communication and network communication as defined in [5]. Memory communication (or memory copying) is the transfer of data from the user's buffer to the local network buffer (or shared-memory buffer) and vice versa. Network communication is the movement of data between source