Abstract-Three-stage non-blocking switching fabrics are the next step in scaling current crossbar switches to many hundreds or few thousands of ports. Congestion management, however, is the central open problem; without it, performance suffers heavily under real-world traffic patterns. Schedulers for bufferless crossbars perform congestion management but are not scalable to high valencies and to multi-stage fabrics. Distributed scheduling, as used in buffered crossbars, is scalable but has never been scaled beyond crossbar valencies. We combine ideas from central and distributed schedulers, from request-grant protocols and from credit-based flow control, to propose a novel, practical architecture for scheduling in non-blocking buffered switching fabrics. The new architecture relies on multiple, independent, single-resource schedulers, operating in a pipeline. It: (i) isolates well-behaved against congested flows; (ii) provides throughput in excess of 95% under unbalanced traffic, and delays that successfully compete again output queueing; (iii) provides weighted max-min fairness; (iv) directly operates on variable-size packets or multi-packet segments; (v) resequences cells or segments using very small buffers; and (vi) can be realistically implemented for a 1024×1024 reference fabric made out of 32×32 buffered crossbar switch elements. This paper carefully studies the many intricacies of the problem and the solution, discusses implementation, and provides performance simulation results.