The high latency of memory accesses is one of the factors that most contribute to reduce the performance of current vector supercomputers. The conflicts that can occur in the memory modules plus the collisions in the interconnection network in the case of multiprocessors make that the execution time of applications increases significantly. In this work we propose a memory access method that for both cases of vector uniprocessors and multiprocessors allows to perform stream accesses with the smallest possible latency in the majority of the cases. The basic idea is to arbitrate the memory access by defining the order in which the memory modules are visited. The stream elements are requested out of order. In addition, the access method also reduces the cost of the interconnection network.The high latency of the memory accesses is one of the main factors that reduces the performance of current vector supercomputers. In such systems, to achieve the required bandwidth, the memory is organized into a set of M = 2m independent modules that are accessed in parallel, The latency of each memory module is of T processor cycles.Conflicts occur between different accesses that visit the same memory module whenever these accesses are separated by a number of cycles that is less than the module latency.Moreover, in multiprocessor systems, collisions can occur also in the interconnection network. These two facts make it Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copyin is by permission of the Association of Computing Y Machinery. o copy otherwise, or to republish, requires a fee and/or specific permission. ISCA '95, Santa Margherita Ligure Italy 0 1995 ACM 0-89791 -698-0/95/0006 ...$3.50 243 difficult to perform accesses with low latency and to effectively use the available memory bandwidth. For the case of a single vector processor with one memory port and a matched memory system (M = T), :several storage schemes have been proposed to efficiently i~ccess streams with the most frequent strides. The basic scheme is interleaving [1], in which the module number is obtained from the m lowest bits of the address; this storage scheme allows a minimum-latency in-order access for streams of odd stride, but results in degraded performance fcjr even strides. other storage schemes, such as skewing [2] and linear transformations [3], also allow the conflict-free access for one family of strides, where the family x is defined as the set of strides S = (S.2X with G odd [4]. However, these latter schemes have the advantage that the degradation for families that are not conflict free can be reduced by the use of buffers [5].To increase the number of conflict-free families,proposals have been made in two directions: more modules are added to the memory system resulting in an unmatched memory system, or a block-interleaved storage sch...