The sparse matrix-vector multiply (SpMV) operation is a key computational kernel in many simulations and linear solvers. The large communication requirements associated with a reference implementation of a parallel SpMV result in poor parallel scalability. The cost of communication depends on the physical locations of the send and receive processes: messages injected into the network are more costly than messages sent between processes on the same node. In this paper, a node aware parallel SpMV (NAPSpMV) is introduced to exploit knowledge of the system topology, specifically the node-processor layout, to reduce costs associated with communication. The values of the input vector are redistributed to minimize both the number and the size of messages that are injected into the network during a SpMV, leading to a reduction in communication costs. A variety of computational experiments that highlight the efficiency of this approach are presented. aspects of the topology -e.g. socket information -could be used in a similar fashion. The mapping of virtual ranks to physical processors can be easily determined on many super computers. The flag MPICH RANK REORDER METHOD can be set to a predetermined ordering on Cray machines, while modern Blue Gene machines allow the user to specify the ordering among the coordinates A, B, C, D, E, and T through the variable RUNJOB MAPPING or a runscript option of --mapping.There are a number of existing approaches for reducing communication costs associated with sparse matrix-vector multiplication. Communication volume in particular is a limiting factor and the ordering and parallel partition of a matrix both influence the total data volume. In response, graph partitioning techniques are used to identify more efficient layouts in the data [4,5,6,7]. ParMETIS [8] and PT-Scotch [9], for example, provide parallel partitioning of matrices that often lead to improved system loads and more efficient sparse matrix operations. Communication volume is accurately modeled through the use of a hypergraph [10]. As a result, hypergraph partitioning also leads to a reduction in parallel communication requirements, albeit at a larger one-time setup cost. Topology-aware task mapping is used to accurately map partitions to the allocated nodes of a supercomputer, reducing the overall cost associated with communication [11,12,13,14,15]. The approach introduced in this paper complements these efforts by providing an additional level of optimization in handling communication.Topology-aware methods and aggregation of data are commmonly used to reduce communication costs, particularly in collective operations [16,17,18,19]. Aggregation of data is used in point to point communication through Tram, a library for streamlining messages in which data is aggregated and communicated only through neighboring processors [20]. The method presented in this paper aggregates messages at the node level and communicates all aggregated data at once, yielding little structural change from standard MPI communication while reducin...