285286 CHAPTER 9 NoC customizations for message passing interface primitives
INTRODUCTIONTo enable continuous exponential performance scaling for parallel applications, multicore designs have become the dominant organization forms for future highperformance microprocessors. The availability of massive on-chip transistors to hardware architects gives rise to the expectation that the exponential growth in the number of cores on a single processor will soon facilitate the integration of hundreds of cores into mainstream computers [3].Traditional shared bus interconnects are incapable of sustaining the multicore architecture with the appropriate degree of scalability and required bandwidth to facilitate communication among a large number of cores. Moreover, full crossbars become impractical with the growing number of cores. Networks-on-chip (NoCs) are becoming the most viable solution to mitigate this problem. The NoCs are conceived to be more cost-effective than the bus in terms of traffic scalability, area, and power in large-scale systems. Such networks are ideal for component reuse, design modularity, plug-and-play, and scalability while avoiding issues with global wire delays. Recent proposals, such as the 64-core TILE64 processor from Tilera [51], Intel's 80-core Teraflops chip [22], Arteris's NoC interconnect IPs [2], and NXP-Philips' AEtheral NoC [18], have successfully demonstrated the potential effectiveness of NoC designs.However, most existing general-purpose NoC designs do not support the highlevel programming model well. This condition compromises performance and efficiency when programs are mapped onto NoC-based hardware architectures. That is, most current programming model optimizations aiming to address these problems maintain a firm abstraction of the interconnection network fabric as a communication medium: protocol optimizations comprise end-to-end messages between requestor and answer nodes, whereas network optimizations separately aim at reducing the communication latency and improving the throughput for data messages. Reducing or even eliminating the gap between the multicore programming models with the underlying NoC-based hardware is a demanding task. Thus, a key challenge in multicore research is the provision of efficient support for parallel programming models to boost applications by exploiting all hardware features available in NoCbased multicore architectures.A large body of research has recently focused on integrating the message passing interface (MPI) standard into multicore architectures. The MPI is a standard, public-domain, platform-independent communications library for message-passing programming. Numerous applications have now been ported to or developed for the MPI model. The performance optimization of such models is a necessity for multicore architectures. These research efforts include the systems-on chip (SoC) MPI library implemented on a Xilinx Virtex field-programmable gate array (FPGA) [30], the rMPI targeting embedded systems using MIT's Raw processor [39], the TMD-MP...