Abstract-Our work is motivated by the desire to design packet switches with large aggregate capacity and fast line rates. In this paper, we consider building a packet switch from multiple lower speed packet switches operating independently and in parallel. In particular, we consider a (perhaps obvious) parallel packet switch (PPS) architecture in which arriving traffic is demultiplexed over identical lower speed packet switches, switched to the correct output port, then recombined (multiplexed) before departing from the system. Essentially, the packet switch performs packet-by-packet load balancing, or inverse multiplexing, over multiple independent packet switches. Each lower speed packet switch operates at a fraction of the line rate . For example, each packet switch can operate at rate . It is a goal of our work that all memory buffers in the PPS run slower than the line rate. Ideally, a PPS would share the benefits of an output-queued switch, i.e., the delay of individual packets could be precisely controlled, allowing the provision of guaranteed qualities of service.In this paper, we ask the question: Is it possible for a PPS to precisely emulate the behavior of an output-queued packet switch with the same capacity and with the same number of ports? We show that it is theoretically possible for a PPS to emulate a first-come first-served (FCFS) output-queued (OQ) packet switch if each lower speed packet switch operates at a rate of approximately 2. We further show that it is theoretically possible for a PPS to emulate a wide variety of quality-of-service queueing disciplines if each lower speed packet switch operates at a rate of approximately 3 . It turns out that these results are impractical because of high communication complexity, but a practical high-performance PPS can be designed if we slightly relax our original goal and allow a small fixed-size coordination buffer running at the line rate in both the demultiplexer and the multiplexer. We determine the size of this buffer and show that it can eliminate the need for a centralized scheduling algorithm, allowing a full distributed implementation with low computational and communication complexity. Furthermore, we show that if the lower speed packet switch operates at a rate of (i.e., without speedup), the resulting PPS can emulate an FCFS-OQ switch within a delay bound.Index Terms-Clos network, inverse multiplexing, load balancing, output queueing, packet switch.