High-radix single-chip routers have emerged as efficient building blocks for interconnection networks. It is too believed that at high radices hierarchical switch architectures are needed as crossbars scale with the square of router radix. This article proposes a novel micro-architecture that allows flat crossbar switches to scale to 128 ports supporting 32Gb/s/port while occupying 4.9mm 2 and consuming 4.2W, or supporting 64Gb/s/port at 7.5mm 2 and 7.5W, in 45nm CMOS. Key features include deep crossbar pipelining to cope with wire delay, a novel cross scheduler architecture to reduce wiring complexity, and catalytic custom gate placement within standard Electronic Design Automation (EDA) flows. Thus, it is also shown that, on chip, crossbar speedup and Combined Input-Output Queuing (CIOQ) is better than Hierarchical Queueing (HQ) providing top performance with orders of magnitude lower memory cost. Finally, a comparison with the recently-developed Swizzle-Switch prototypes is plotted and the potential of high-radix crossbars for System-on-a-Chip interconnects is advocated.Index Terms-On-chip interconnection networks, packet-switching networks, VLSI † Manolis Katevenis is also with the