Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems 2006
DOI: 10.1145/1168857.1168897
|View full text |Cite
|
Sign up to set email alerts
|

Integrated network interfaces for high-bandwidth TCP/IP

Abstract: This paper proposes new network interface controller (NIC) designs that take advantage of integration with the host CPU to provide increased flexibility for operating system kernel-based performance optimization. We believe that this approach is more likely to meet the needs of current and future high-bandwidth TCP/IP networking on end hosts than the current trend of putting more complexity in the NIC, while avoiding the need to modify applications and protocols. This paper presents two such NICs. The first, t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
18
1

Year Published

2008
2008
2021
2021

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 34 publications
(19 citation statements)
references
References 24 publications
0
18
1
Order By: Relevance
“…Unlike existing profiling tools attributing CPU cost, such as retired cycles or cache misses, to function level [29], our instrumentation is at very fine granularity and can pinpoint data incurring the cost. Through the above studies, we obtain several new findings: 1) the major network processing bottlenecks lie in the driver (>26%), data copy (up to 34% depending on I/O sizes) and buffer release (>20%), rather than the TCP/IP protocol itself; 2) in contrast to the generally accepted notion that long latency NIC register access res ults in the driver overhead [2][3], our results show that the overhead comes from memory stalls to network buffer data structures; 3) releasing network buffers in OS results in memory stalls to in-kernel page data structures, contributing to the buffer release overhead; 4) besides memory stalls to packets, data copy implemented as a series of load/store instructions, also has significant time on L1 cache misses and instruction execution. Prevailing optimizations for data copy like DCA are insufficient to address the copy issue.…”
Section: Introductionmentioning
confidence: 86%
See 2 more Smart Citations
“…Unlike existing profiling tools attributing CPU cost, such as retired cycles or cache misses, to function level [29], our instrumentation is at very fine granularity and can pinpoint data incurring the cost. Through the above studies, we obtain several new findings: 1) the major network processing bottlenecks lie in the driver (>26%), data copy (up to 34% depending on I/O sizes) and buffer release (>20%), rather than the TCP/IP protocol itself; 2) in contrast to the generally accepted notion that long latency NIC register access res ults in the driver overhead [2][3], our results show that the overhead comes from memory stalls to network buffer data structures; 3) releasing network buffers in OS results in memory stalls to in-kernel page data structures, contributing to the buffer release overhead; 4) besides memory stalls to packets, data copy implemented as a series of load/store instructions, also has significant time on L1 cache misses and instruction execution. Prevailing optimizations for data copy like DCA are insufficient to address the copy issue.…”
Section: Introductionmentioning
confidence: 86%
“…Existing studies [2][3] claimed that NIC register access contributes to the driver overhead due to long latency traversal over PCI-E bus, and then proposed NIC integration to reduce the overhead. In this subsection, we architecturally breakdown the driver overhead for each packet and p resent results in Figure 3.…”
Section: Dri Vermentioning
confidence: 99%
See 1 more Smart Citation
“…Such designs, however, do not assume large manycore chips where the NOC latency represents a significant fraction of the end-to-end latency. An NI can leverage coherent shared memory also to optimize TCP/IP stacks [7,23,31]. While previous proposals on NI optimization were focused on traditional networking and were thus inevitably engaged with expensive network protocol processing, our work focuses on specialized NIs for RDMA-like communication, in which execution of remote operations only requires low-cost user-level interactions with memory-mapped queues and minimal protocol processing.…”
Section: Related Workmentioning
confidence: 99%
“…Alternatively, there are manycore tiled processors with lean per-tile NIs directly integrated into the core's pipeline [7]. While per-tile designs optimize for low latency, they primarily target fine-grain (e.g., scalar) communication and are not suitable for in-memory rack-scale computing with coarse-grain objects from hundreds of bytes to tens of kilobytes.…”
Section: Introductionmentioning
confidence: 99%