“…Sorting is one of the fundamental computation kernels in many big data applications and there have been continuous efforts on designing high-performance sorting accelerators on FPGAs [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]. Most of these sorting accelerators are based on the multi-way merge tree sort algorithm [1], [2], [3], [4], [5], [6], [7], [8], [9], due to its massive data parallelism and regular memory access patterns. Before the maturing of the high-bandwidth memory (HBM) technology, these sorters were usually implemented on DRAM-based FPGAs and bounded by the off-chip memory bandwidth [7], [8].…”