There has been much effort in offloading MPI collective operations into hardware.But while NIC-based collective acceleration is well-studied, offloading their processing into the switching fabric, despite numerous advantages, has been much more limited.A major problem with fixed logic implementations is that either only a fraction of the possible collective communication is accelerated or that logic is wasted in the applications that do not need a particular capability. Using reconfigurable logic has numerous advantages: exactly the required operations can be implemented; the level of desired performance can be specified; and new, possibly complex, operations can be defined and implemented. We have designed an in-switch collective accelerator, MPI-FPGA, and demonstrated its use with seven MPI collectives and over a set of benchmarks and proxy applications (MiniApps). The accelerator uses a novel two-level switch design containing fully pipelined vectorized aggregation logic units. Essential to this work is providing support for sub-communicator collectives that enables communicators of arbitrary shape, and that is scalable to large systems. A streaming interface improves the performance for long messages. While this reconfigurable design is generally applicable, we prototype it with an FPGA-centric cluster. A sample MPI-FPGA design in a direct network achieves considerable speedups over conventional clusters in the most likely scenarios. We also present results for indirect networks with reconfigurable high-radix switches and show that this approach is competitive with SHArP technology for the subset of operations that SHArP supports. MPI-FPGA is fully integrated into MPICH and is transparent to MPI applications.