A field-programmable gate array (FPGA) is a type of programmable hardware, where a logic designer must create a specific hardware design and then "compile" it into a bitstream that "configures" the device for a specific function at power-up. This compiling process, known as place-and-route (PAR), can take hours or even days, a duration which discourages the use of FPGAs for solving compute-oriented problems. To help mitigate this and other problems, overlays are emerging as useful design patterns in solving compute-oriented problems. An overlay consists of a set of compiler-like tools and an architecture written in a hardware design language like VHDL or Verilog. This cleanly separates the compiling problem into two phases: at the front end, high-level language compilers can quickly map a compute task into the overlay architecture, which is now serving as an intermediate layer. Unfortunately, the back-end of the process, where an overlay architecture is compiled into an FPGA device, remains a very time-consuming task. Many attempts have been made to accelerate the PAR process, ranging from using multicore processors, making quality/runtime tradeoffs, and using hard macros, with limited success. We introduce a new hard-macro methodology, called Rapid Overlay Builder, and demonstrate a run-time improvement up to 22 times compared to a regular unaccelerated flow using Xilinx ISE. In addition, compared to prior work, ROB continues to work well even with high logic utilization levels of 89%, and it consistently maintains high clock rates. By applying this methodology, we anticipate that overlays can be implemented much more quickly and with lower area and speed overheads than would otherwise be possible. This will greatly improve the usability of FPGAs, allowing them to be used as a replacement for CPUs in a greater variety of applications.