This paper presents a FIFO-based parallel merge sorter optimized for the latest FPGA. More specifically, we show a sorter that sorts K keys in latency K +log 2 K −1 using log 2 K comparators. It uses K M + log 2 K + log 2 M − 1 memory blocks with capacity M to implement FIFOs. It receives K keys one by one in every clock cycle and outputs the sorted sequence of them from K + log 2 K − 1 clock cycles after. Since K clock cycles are necessary to input all K keys, our sorter is almost optimal in terms of the latency. Also, since the total FIFO capacity is only K + M log 2 K + M log 2 M − M and at least K keys must be stored in the sorter, our sorter is also almost optimal in terms of the total FIFO capacity if M is small. This paper also presents topK-sorter, which outputs top K keys in N input keys for any large N . Our topK-sorter runs in latency N + log 2 K using log 2 K + 1 comparators. It uses memory blocks of size M and the total FIFO capacity is only 2K +M log 2 K +M log 2 M − 2M . Quite surprisingly, the total FIFO capacity is independent of N . Also, since the latency must be at least N , that of our topKsorter is almost optimal in terms of the latency. Finally, we have implemented our K-sorter and topK-sorter in a Xilinx Virtex-7 FPGA using built-in Distributed RAMs and Block RAMs. The implementation results show that our K-sorter reduces the used memory resources by half, and both K-sorter and topK-sorter are practical and efficient.14th International Symposium on Parallel and Distributed Computing 978-1-4673-7148-3/15 $31.00