Research has demonstrated the potential of accelerators in a wide range of use cases. However, there is a growing imbalance between modern hardware and the CPUs that submit the workload. Recent studies of GPUs on real systems have shown that many servers are often needed per accelerator to generate a high enough load so the computing power is leveraged. This fact is often ignored in research, although it often determines the actual feasibility and overall efficiency of a deployment. In this paper, we conduct a detailed study of the possible configurations and overall cost efficiency of deploying an FPGA-based accelerator on a commercial search engine. First, we show that there are many possible configurations balancing the upstream system and the way the accelerator is configured. Of these configurations, not all of them are suitable in practice, even if they provide some of the highest throughput. Second, we analyse the cost of a deployment capable of sustaining the required workload of the commercial search engine. We examine deployments both on-premises and in the cloud with and without FPGAs and with different board models. The results show that, while FPGAs have the potential to significantly improve overall performance, the performance imbalance between their host CPUs and the FPGAs can make the deployments economically unattractive. These findings are intended to inform the development and deployment of accelerators by showing what is needed on the CPU side to make them effective and also to provide important insights into their end-to-end integration within existing systems.
CCS CONCEPTS• Hardware → Hardware accelerators; • Computer systems organization → Cloud computing; Client-server architectures; Heterogeneous (hybrid) systems.