We have developed a highly-efficient and simple parallel hardware design for merging two sorted lists residing in banked (or multi-ported) memory. The FPGA implementation uses half the hardware resources required for implementing the current state-of-the-art architecture. This is achieved with better performance and half the latency, for the same amount of parallelism. The challenges for the merge operations in FPGAs have been the low clock frequency due to the feedback datapath of the merger being the critical path for timing, and also the high resource utilisation in recent attempts to eliminate/remove the feedback datapath. Our solution uses a modified version of the bitonic merge block, as found in a bitonic sorter, repurposed for performing parallel merge for streaming data. As with the state-of-the-art, it can be considered feedback-less since it only nests one parallel comparison for any desired level of parallelism. This leads to high operating frequency designs, 1.3 times higher than the previous best on our test platform. Since the new design uses 2 times fewer hardware resources, it allows more parallelism and leaves room for additional logic.
This work improves on the latest research about sorting acceleration on FPGAs. An efficient design is introduced for sorting data that fit on-chip, with the additional functionality to merge sorted sublists recursively, for an input of arbitrary length. While many-leaf mergers are conventionally single-rate, a novel technique in our approach is to use a parallel merge tree only for the latest stages of the merge tree, to enable bandwidth-adapted multi-rate many-leaf merge. Our open-source RTL generator produces sorting peripherals with customisable parallelism and data format. We evaluate our FPGA design as an 128-bit wide peripheral on an MPSoC platform, with a speedup of up to 49 times over the A53 core for sorting, and up to 27 times speedup for our specialized database analytics application.
Database systems are key to a variety of applications, and FPGA-based accelerators have shown promise in supporting high-performance database systems. This survey presents a systematic review of research relating to accelerating analytical database systems using FPGAs. The review includes studies of database acceleration frameworks and accelerator implementations for various database operators. Finally, the survey includes some promising future technologies and discussion on the challenges to be addressed by the future research in this area.
We present different methods to increase the performance of Hybrid Monte Carlo simulations of the Hubbard model in two-dimensions. Our simulations concentrate on a hexagonal lattice, though can be easily generalized to other lattices. It is found that best results can be achieved using a flexible GMRES solver for matrix inversions and the second order Omelyan integrator with Hasenbusch acceleration on different time scales for molecular dynamics. We demonstrate how an arbitrary number of Hasenbusch mass terms can be included into this geometry and find that the optimal speed depends weakly on the choice of the number of Hasenbusch masses and their values. As such, the tuning of these masses is amenable to automization and we present an algorithm for this tuning that is based on the knowledge of the dependence of solver time and forces on the Hasenbusch masses. We benchmark our algorithms to systems where direct numerical diagonalization is feasible and find excellent agreement. We also simulate systems with hexagonal lattice dimensions up to 102 × 102 and N t = 64. We find that the Hasenbusch algorithm leads to a speed up of more than an order of magnitude.
We present an efficient, high-throughput and scalable hardware design for accelerating the merge phase of the sort-merge join operation. Sort-merge join is one of the fundamental join algorithms and among the most frequently executed operations in relational databases. It has been the focus of various recent pieces of research, each having different shortcomings and usually only focusing on the sort phase. In this paper, a new parallel sort-merge join architecture is developed, that provides data streaming functionality and high throughput. The key idea of the paper is the use of a novel design for a co-grouping engine, with which the input data are summarised on-the-fly. In this way, the operation is performed on streams of data, preserving the linear access pattern for faster data movement and also eliminates the need for a replay buffer/cache. In contrast to related work, our approach does not make assumptions about the value distribution of the input data and applies to any input size and width. We evaluate the design on a Zynq UltraScale+ based platform, and show that there is up to 3.1 times speedup over the host processor, even without accelerating the sort phase, while still taking into account the data transfers from and to main memory in Linux.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.