We present a scalable design for accelerating the problem of solving a dense linear system of equations using LU Decomposition. A novel systolic array architecture that can be used as a building block in scientific applications is described and prototyped on a Xilinx Virtex 6 FPGA. This solver has a throughput of around 3.2 million linear systems per second for matrices of size N=4 and around 80 thousand linear systems per second for matrices of size N=16. In comparison with similar work, our design offers up to a 12-fold improvement in speed whilst requiring up to 50% less hardware resources. As a result, a linear system of size N=64 can be implemented on a single FPGA, whereas previous work was limited to a size of N=12 and resorted to complex multi-FPGA architectures to scale. Finally, the scalable design can be adapted to different sized problems with minimum effort.