The overlay architecture enables to raise the abstraction level of hardware design and enhances hardware-accelerated applications’ portability. In FPGAs, there is a growing awareness of the overlay structure as typified by many-core architecture. It works in theory; however, it is difficult in practice because it is beset with serious design issues. For example, the size of FPGAs is bigger than before. It is exacerbating the issue of the place-and-route. Besides, a single FPGA is actually the sum of small-to-middle FPGAs by advancing packaging technology like silicon interposers. Thus, the tightly-coupled many-core designs will face this covert issue that the wires among the regions are extremely restricted. This article proposes efficient essential processing elements, micro-architecture design, and the interconnect architecture towards a scalable many-core overlay design. In particular, our work proposes a novel compact buffering technique to reduce memory resource utilization in tightly-connected overlays while preserving computational efficiency. This technique reduces BRAMs utilization to nearly 50% while achieving a best-case computational efficiency of 91.93% in a 3D Jacobi benchmark. Besides, the proposed enhancements led to around 2x and 3x improvement in performance and power efficiency, respectively. Moreover, the improved scalability allowed increasing compute resources and delivering around 4x better performance and power efficiency, as compared to the baseline DRAGON overlay.