2013 International Conference on Field-Programmable Technology (FPT) 2013
DOI: 10.1109/fpt.2013.6718331
|View full text |Cite
|
Sign up to set email alerts
|

Virtual-to-Physical address translation for an FPGA-based interconnect with host and GPU remote DMA capabilities

Abstract: We developed a custom FPGA-based Network Interface Controller named APEnet+ aimed at GPU accelerated clusters for High Performance Computing. The card exploits peer-to-peer capabilities (GPU-Direct RDMA) for latest NVIDIA GPGPU devices and the RDMA paradigm to perform fast direct communication between computing nodes, offloading the host CPU from network tasks execution. In this work we focus on the implementation of a Virtual to Physical address translation mechanism, using the FPGA embedded soft-processor. A… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2015
2015
2019
2019

Publication Types

Select...
6

Relationship

1
5

Authors

Journals

citations
Cited by 16 publications
(13 citation statements)
references
References 8 publications
0
13
0
Order By: Relevance
“…In [21] a number of CAM designs suitable for FPGA implementation are discussed. Our approach is a RAM-based design (in terms of resources used) but with a significant difference in the interpretation of the RAM ports semantics, as explained in [6].…”
Section: Hardware Tlb Implementationmentioning
confidence: 99%
See 2 more Smart Citations
“…In [21] a number of CAM designs suitable for FPGA implementation are discussed. Our approach is a RAM-based design (in terms of resources used) but with a significant difference in the interpretation of the RAM ports semantics, as explained in [6].…”
Section: Hardware Tlb Implementationmentioning
confidence: 99%
“…Just like on x86_64 an associative cache called Translation Lookaside Buffer (TLB) assists the MMU in its duties, we followed suit and implemented a TLB for APEnet+ by means of a Content Addressable Memory (CAM) [6]. The CAM implementation allows the lowest possible latency for the TLB, which achieves actual saturation over the links.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In the HPC domain, SVM for letting FPGA accelerators themselves orchestrate data transfers from and to main memory has proven beneficial both for performance and programmability [3] [23]. While academic works focus on SVM solutions consisting of TLBs managed in software either running on the host [23] or on a dedicated soft processor core [24], the industry's approach is that of fullblown hardware for maximum performance [3] [4] [5]. To optimize off-chip bandwidth utilization between host and FPGA, such systems employ FPGA-side data caches, as well as transaction coalescing and reordering [25], which further increases design complexity and leads to considerable resource utilization.…”
Section: Related Workmentioning
confidence: 99%
“…The hardware follows the RDMA paradigm to manage the I/O towards both CPU and GPU -for this latter the protocol is more precisely the GPUDirect RDMA [5], -in this way avoiding the use of bounce buffers and minimizing the latency jitter of data transfers to and from application memory. The virtual-to-physical address translation is entrusted to a proprietary Translation Look-aside Buffer based on Content Addressable Memory [6]. In order to sustain the ∼ 320 MB/s aggregate bandwidth of the multi-channel system, incoming data are sent to the computing node by PCIe DMA write processes.…”
Section: Nanetmentioning
confidence: 99%