Nowadays, the use of hardware accelerators to boost the performance of HPC applications is a consolidated practice, and among others, GPUs are by far the most widespread. More recently, some data centers have successfully deployed also FPGA accelerated systems, especially to boost machine learning inference algorithms. Given the growing use of machine learning methods in various computational fields, and the increasing interest towards reconfigurable architectures, we may expect that in the near future FPGA based accelerators will be more common in HPC systems, and that they could be exploited also to accelerate general purpose HPC workloads. In view of this, tools able to benchmark FPGAs in the context of HPC are necessary for code developers to estimate the performance of applications, as well as for computer architects to model that of systems at scale. To fulfill these needs, we have developed FER (FPGA Empirical Roofline), a benchmarking tool able to empirically measure the computing performance of FPGA based accelerators, as well as the bandwidth of their on-chip and off-chip memories. FER measurements enable to draw Roofline plots for FPGAs, allowing for performance comparisons with other processors, such as CPUs and GPUs, and to estimate at the same time the performance upper-bounds that applications could achieve on a target device. In this paper we describe the theoretical model on which FER relies, its implementation details, and the results measured on Xilinx Alveo accelerator cards. 134 = 536 GFLOP/s, (9) 723 resulting approximately 20% higher with respect to the max-724 imum performance we measured empirically with FER, and 725 reported in Fig. 3. 726 Concerning the on-chip memories, such as URAMs, 727 we can use a similar approach to estimate their maxi-728 mum bandwidth. Using the conservative values suggested by 729 Xilinx best practices, in this case 300 MHz of clock frequency 730 and 80% as utilization factor, Eq. 4 gives: 731 B uram = 300 MHz × 64 bit × 2 × 1280 × 0.8 732 = 4.91 TB/s, (10) 733 where 1280 is the amount of available dual-port (thus we 734 multiply by 2 their number) URAMs. Each URAM block is 735 72 bits wide, but with ECC (Error Correction Code) enabled 736 it offers 64 bits wide protected data words. In this paper we 737 always consider ECC to be enabled. The maximum band-738 width would be 6.1TB/s with a 100% utilization. 739 Concerning the off-chip memory bandwidth, assuming that 740 this is not limited by the user design, the maximum value 741 estimated by Eq. 6 for the 4 DDR4 banks results in: 742 does not impact on the local memory performance.943 C. CROSS-ARCHITECTURAL COMPARISON 944 Using the DP-FP FMAs as main mathematical operation, for 945 which the floating point accuracy is granted to be compliant 946 with the IEEE-754 standard [53], we can also use FER results 947 to compare FPGAs with commodity processors. 948 In Fig. 6 we compare the Roofline plots of U50, U250 and 949 U280 FPGAs, with that of Intel Xeon Gold 6130 (based on 950 Skylake micro-architecture) measure...