Magnetic Resonance (MR)-guided online Adaptive RadioTherapy (MR-goART) utilises the excellent soft-tissue contrast of MR images taken just before the patient's treatment to quickly update and personalise radiotherapy treatment plans. Four-dimensional (4D) MR Imaging (MRI) can resolve variations in respiratory motion patterns. 4D MRI data can be used to adapt the radiation beams to maximally target the tumour while sparing as much healthy tissue as possible. 4D MRI reconstruction, however, is computationally challenging and current state-of-the-art implementations are unable to meet MRgoART time requirements. This study bridges the gap between high-performance computing and medical applications by developing and implementing a parallel, heterogeneous architecture for the XD-GRASP algorithm capable of meeting the MR-goART time requirements. Our architecture exploits long-vector instructions and utilises all available resources, while minimising and hiding the communication overhead when external GPUs are used. As a result, the reconstruction time was reduced from 994 seconds to just 90 seconds with a speed-up of more than 11x. In addition, we evaluated the impact of the emerging Processing-in-Memory (PIM) technology. Our simulation results show that 16 low power, in-order PIM cores with no SIMD unit are 2.7x faster than an Intel Core™ i7-9700 8-core CPU equipped with AVX512 SIMD units. Additionally, 40 PIM cores match the performance of two AMD EPYC 7551 CPUs, with 32 cores each and just 87 PIM cores
We present results from a stand-alone simulation of electron single Coulomb scattering as implemented completely on an Field Programmable Gate Array (FPGA) architecture and compared with an identical simulation on a standard CPU. FPGA architectures offer unprecedented speed-up capability for Monte Carlo simulations, however with the caveats of lengthy development cycles and resource limitation, particularly in terms of on-chip memory and DSP blocks. As a proof of principle of acceleration on an FPGA, we chose a single scattering process of electrons in water at an energy of 6 MeV. The initial code-base was implemented in C++ and optimised for CPU processing. To measure the potential performance gains of FPGAs compared to modern multi-core CPUs we computed 100M histories of a 6 MeV electron interacting in water. Without performing any hardware-specific optimisation, the results show that the FPGA implementation is over 110 times faster than an optimised parallel implementation running on 12 CPU-cores, and over 270 times faster than a sequential single-core CPU implementation. The results on both architectures were statistically equivalent. The successful implementation and acceleration results are very encouraging for the future exploitation of more sophisticated Monte Carlo simulation on FPGAs for High Energy Physics applications.
Processing large-scale graphs is challenging due to the nature of the computation that causes irregular memory access patterns. Managing such irregular accesses may cause significant performance degradation on both CPUs and GPUs. Thus, recent research trends propose graph processing acceleration with Field-Programmable Gate Arrays (FPGA). FPGAs are programmable hardware devices that can be fully customised to perform specific tasks in a highly parallel and efficient manner. However, FPGAs have a limited amount of on-chip memory that cannot fit the entire graph. Due to the limited device memory size, data needs to be repeatedly transferred to and from the FPGA on-chip memory, which makes data transfer time dominate over the computation time. A possible way to overcome the FPGA accelerators’ resource limitation is to engage a multi-FPGA distributed architecture and use an efficient partitioning scheme. Such a scheme aims to increase data locality and minimise communication between different partitions. This work proposes an FPGA processing engine that overlaps, hides and customises all data transfers so that the FPGA accelerator is fully utilised. This engine is integrated into a framework for using FPGA clusters and is able to use an offline partitioning method to facilitate the distribution of large-scale graphs. The proposed framework uses Hadoop at a higher level to map a graph to the underlying hardware platform. The higher layer of computation is responsible for gathering the blocks of data that have been pre-processed and stored on the host’s file system and distribute to a lower layer of computation made of FPGAs. We show how graph partitioning combined with an FPGA architecture will lead to high performance, even when the graph has Millions of vertices and Billions of edges. In the case of the PageRank algorithm, widely used for ranking the importance of nodes in a graph, compared to state-of-the-art CPU and GPU solutions, our implementation is the fastest, achieving a speedup of 13 compared to 8 and 3 respectively. Moreover, in the case of the large-scale graphs, the GPU solution fails due to memory limitations while the CPU solution achieves a speedup of 12 compared to the 26x achieved by our FPGA solution. Other state-of-the-art FPGA solutions are 28 times slower than our proposed solution. When the size of a graph limits the performance of a single FPGA device, our performance model shows that using multi-FPGAs in a distributed system can further improve the performance by about 12x. This highlights our implementation efficiency for large datasets not fitting in the on-chip memory of a hardware device.
Field Programmable Gate Arrays (FPGAs) are gaining popularity in the context of scientific computing due to the recent advances of High-Level Synthesis (HLS) toolchains for customised hardware implementations combined with the increase in computing capabilities of modern FPGAs. As a result, developers are able to implement more complex scientific workloads which often require the evaluation of univariate numerical functions. In this study, we propose a methodology for tablebased polynomial interpolation aiming at producing area-efficient implementations of such functions on FPGAs achieving the same accuracy and at similar performance as direct implementations. We also provide a rigorous error analysis to guarantee the correctness of the results. Our methodology covers the forecast of resource utilisation of the polynomial interpolator and, based on the characteristics of the function, guides the developer to the most area-efficient FPGA implementation. Our experiments show that in the case of a radiation spectrum of a Black Body application based on evaluating Planck's Law, it is possible to reduce resource utilisation by up to 90% when compared to direct implementations not using table-based methods. Moreover, when only the kernels are considered, our method uses up to two orders of magnitude fewer resources with no performance penalties. Based on previous more theoretical works, our study investigates practical applications of table-based methods in the context of high performance and scientific computing where it is used to implement common but more complex functions than the elementary functions widely studied in the related literature.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.