Integrated network interfaces for high-bandwidth TCP/IP

Binkert, Nathan; Саиди, Али; Reinhardt, Steven K.

doi:10.1145/1168857.1168897

Cited by 34 publications

(19 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unlike existing profiling tools attributing CPU cost, such as retired cycles or cache misses, to function level [29], our instrumentation is at very fine granularity and can pinpoint data incurring the cost. Through the above studies, we obtain several new findings: 1) the major network processing bottlenecks lie in the driver (>26%), data copy (up to 34% depending on I/O sizes) and buffer release (>20%), rather than the TCP/IP protocol itself; 2) in contrast to the generally accepted notion that long latency NIC register access res ults in the driver overhead [2][3], our results show that the overhead comes from memory stalls to network buffer data structures; 3) releasing network buffers in OS results in memory stalls to in-kernel page data structures, contributing to the buffer release overhead; 4) besides memory stalls to packets, data copy implemented as a series of load/store instructions, also has significant time on L1 cache misses and instruction execution. Prevailing optimizations for data copy like DCA are insufficient to address the copy issue.…”

Section: Introductionmentioning

confidence: 86%

“…Existing studies [2][3] claimed that NIC register access contributes to the driver overhead due to long latency traversal over PCI-E bus, and then proposed NIC integration to reduce the overhead. In this subsection, we architecturally breakdown the driver overhead for each packet and p resent results in Figure 3.…”

Section: Dri Vermentioning

confidence: 99%

“…To eliminate memo ry stalls to packets, Intel proposed DCA to route incoming network data to caches [9,[18][19]. Furthermore, Binkert et al [2][3] integrated a simp lified NIC into CPUs to naturally implement DCA and used Program I/ O (PIO) to move data fro m NICs in software. We also did extensive performance evaluation of an Integrated NIC (INIC) architecture [20] and along with Intel researchers , suggested some techniques to improve the network processing performance [21][22].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A new server I/O architecture for high speed networks

Liao

Znu

Bnuyan

2011

2011 IEEE 17th International Symposium on High Performance Computer Architecture

View full text Add to dashboard Cite

show abstract

Section: Introductionmentioning

confidence: 86%

Section: Dri Vermentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A new server I/O architecture for high speed networks

Liao

Znu

Bnuyan

2011

2011 IEEE 17th International Symposium on High Performance Computer Architecture

View full text Add to dashboard Cite

show abstract

“…Such designs, however, do not assume large manycore chips where the NOC latency represents a significant fraction of the end-to-end latency. An NI can leverage coherent shared memory also to optimize TCP/IP stacks [7,23,31]. While previous proposals on NI optimization were focused on traditional networking and were thus inevitably engaged with expensive network protocol processing, our work focuses on specialized NIs for RDMA-like communication, in which execution of remote operations only requires low-cost user-level interactions with memory-mapped queues and minimal protocol processing.…”

Section: Related Workmentioning

confidence: 99%

“…Alternatively, there are manycore tiled processors with lean per-tile NIs directly integrated into the core's pipeline [7]. While per-tile designs optimize for low latency, they primarily target fine-grain (e.g., scalar) communication and are not suitable for in-memory rack-scale computing with coarse-grain objects from hundreds of bytes to tens of kilobytes.…”

Section: Introductionmentioning

confidence: 99%

Manycore network interfaces for in-memory rack-scale computing

Daglis¹,

Novaković²,

Bugnion³

et al. 2015

Proceedings of the 42nd Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on commodity SoCs, low-latency and highbandwidth communication fabrics and a remote memory access model to enable aggregation of a rack's memory for critical data-intensive applications such as graph processing or key-value stores. Low latency and high bandwidth not only dictate eliminating communication bottlenecks in the software protocols and off-chip fabrics but also a careful on-chip integration of network interfaces. The latter is a key challenge especially in architectures with RDMA-inspired one-sided operations that aim to achieve low latency and high bandwidth through on-chip Network Interface (NI) support. This paper proposes and evaluates network interface architectures for tiled manycore SoCs for in-memory rack-scale computing. Our results indicate that a careful splitting of NI functionality per chip tile and at the chip's edge along a NOC dimension enables a rack-scale architecture to optimize for both latency and bandwidth. Our best manycore NI architecture achieves latencies within 3% of an idealized hardware NUMA and efficiently uses the full bisection bandwidth of the NOC, without changing the on-chip coherence protocol or the core's microarchitecture.

show abstract