Manycore network interfaces for in-memory rack-scale computing

Daglis, Alexandros; Novaković, Stanko; Bugnion, Edouard; Falsafi, Babak; Grot, Boris

doi:10.1145/2749469.2750415

Cited by 22 publications

(14 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another drawback is that it is currently only possible for such protocols to work with devices and device drivers that explicitly supports them. A proposed approach for overcoming the protocol translation overhead would be to integrate network interface functionality directly into SoCs [7], but the improvement only takes effect when the SoCs are in communication with each other. This idea is followed in the rack-scale architecture [6], which generalizes a trend returning from switched cluster architectures to hypercube architectures [11,32].…”

Section: Distributed Io Using Rdmamentioning

confidence: 99%

Flexible Device Sharing in PCIe Clusters using Device Lending

Markussen

Kristiansen

Stensland

et al. 2018

Proceedings of the 47th International Conference on Parallel Processing Companion

View full text Add to dashboard Cite

Processing workloads may have very high IO demands, exceeding the capabilities provided by resource virtualization and requiring direct access to the physical hardware. For computers that are interconnected in PCI Express (PCIe) networks, we have previously proposed Device Lending as a solution for assigning devices to remote hosts. In this paper, we explain how we have extended our implementation with support for the Linux Kernel-based Virtual Machine (KVM) hypervisor. Using our extended Device Lending, it becomes possible to dynamically "pass through" physical remote devices to VM guests while still retaining the flexibility of virtualization, something that previously required extensive facilitation in both hypervisor and device drivers in the form of paravirtualization. We have also improved our original implementation with support for interoperability between remote devices. We show that it is possible to use multiple devices residing in different hosts, while still achieving the same bandwidth and latency as native PCIe, and without requiring any additional support in device drivers. CCS CONCEPTS • Computer systems organization → Distributed architectures; Interconnection architectures; Cloud computing;

show abstract

Section: Distributed Io Using Rdmamentioning

confidence: 99%

Flexible Device Sharing in PCIe Clusters using Device Lending

Markussen

Kristiansen

Stensland

et al. 2018

Proceedings of the 47th International Conference on Parallel Processing Companion

View full text Add to dashboard Cite

show abstract

“…EMC/Isilon [23]) solutions to clients connected via a conventional network. Ap-pliedMicro's X-Gene2 server SoC [49] and Oracle's Sonoma [34] integrate the RDMA controller directly on chip, HP Moonshot [36] combines low-power processors with RDMA NICs, and research proposals further argue for on-chip support for one-sided remote access primitives [17,56]. The benefit of such rack-scale memory pooling approaches is that building larger logical entities comes at a lower cost and complexity as compared to the cache-coherent NUMA (ccNUMA) approach.…”

Section: Architectural Building Blocksmentioning

confidence: 99%

“…• We implement RackOut KVS (RO-KVS), a proof-of-concept KVS using a conventional network for client access and an RDMA fabric for memory access. RO-KVS is based on FaRM [22] and is ported to both Mellanox RDMA [52] and Scale-Out NUMA [17,56]. We evaluate RO-KVS using RackOut_static scheduling in terms of its 99th percentile tail latency for the hottest rack of a 512-server deployment.…”

Section: Introductionmentioning

confidence: 99%

Mitigating Load Imbalance in Distributed Data Serving with Rack-Scale Memory Pooling

Novaković

Daglis

Ustiugov

et al. 2018

ACM Trans. Comput. Syst.

Self Cite

View full text Add to dashboard Cite

To provide low-latency and high-throughput guarantees, most large key-value stores keep the data in the memory of many servers. Despite the natural parallelism across lookups, the load imbalance, introduced by heavy skew in the popularity distribution of keys, limits performance. To avoid violating tail latency servicelevel objectives, systems tend to keep server utilization low and organize the data in micro-shards, which provides units of migration and replication for the purpose of load balancing. These techniques reduce the skew but incur additional monitoring, data replication, and consistency maintenance overheads. In this work, we introduce RackOut, a memory pooling technique that leverages the one-sided remote read primitive of emerging rack-scale systems to mitigate load imbalance while respecting service-level objectives. In RackOut, the data are aggregated at rack-scale granularity, with all of the participating servers in the rack jointly servicing all of the rack's micro-shards. We develop a queuing model to evaluate the impact of RackOut at the datacenter scale. In addition, we implement a RackOut proof-of-concept key-value store, evaluate it on two experimental platforms based on RDMA and Scale-Out NUMA, and use these results to validate the model. We devise two distinct approaches to load balancing within a RackOut unit, one based on random selection of nodes-RackOut_static-and another one based on an adaptive load balancing mechanism-RackOut_adaptive. Our results show that RackOut_static increases throughput by up to 6× for RDMA and 8.6× for Scale-Out NUMA compared to a scale-out deployment, while respecting tight tail latency servicelevel objectives. RackOut_adaptive improves the throughput by 30% for workloads with 20% of writes over RackOut_static.

show abstract

“…In-memory processing and the use of remote direct memory access as the underlying communications system is a growing trend in large-scale computing. Architectures such as scale-out non-uniform memory access (NUMA) [ 30 ] for rack-scale computers are very sensitive to latency and thus have latency-reducing designs [ 31 ]. However, they have limited scalability due to intrinsic physical limitations of the propagation delay among different elements of the system.…”

Section: Limitations Of Current-day Architecturesmentioning

confidence: 99%

“…A fibre used for inter-server connection has a propagation delay of 5 ns/m; thus, within a standard height rack, the propagation delay between the top and bottom rack units is approximately 9 ns, and the round-trip time to fetch remote data is 18 ns. While for current generation architectures this order of latency is reasonable [ 31 ], it indicates scale-out NUMA machines at data-centre scale (with each round-trip taking at least 1 μs) are not plausible, as the round-trip latency alone is many magnitudes the time-scale for memory retrieval off local random access memory or the latency contribution of any other element in the system. With latencies aggressively reduced across all other elements of in-memory architectures, such propagation delays set a limit on the physical size and thus the scalability of such an architecture.…”

Section: Limitations Of Current-day Architecturesmentioning

confidence: 99%

From photons to big-data applications: terminating terabits

Zilberman¹,

Moore²,

Crowcroft³

2016

Phil. Trans. R. Soc. A.

View full text Add to dashboard Cite

Computer architectures have entered a watershed as the quantity of network data generated by user applications exceeds the data-processing capacity of any individual computer end-system. It will become impossible to scale existing computer systems while a gap grows between the quantity of networked data and the capacity for per system data processing. Despite this, the growth in demand in both task variety and task complexity continues unabated. Networked computer systems provide a fertile environment in which new applications develop. As networked computer systems become akin to infrastructure, any limitation upon the growth in capacity and capabilities becomes an important constraint of concern to all computer users. Considering a networked computer system capable of processing terabits per second, as a benchmark for scalability, we critique the state of the art in commodity computing, and propose a wholesale reconsideration in the design of computer architectures and their attendant ecosystem. Our proposal seeks to reduce costs, save power and increase performance in a multi-scale approach that has potential application from nanoscale to data-centre-scale computers.

show abstract

Manycore network interfaces for in-memory rack-scale computing

Cited by 22 publications

References 37 publications

Flexible Device Sharing in PCIe Clusters using Device Lending

Flexible Device Sharing in PCIe Clusters using Device Lending

Mitigating Load Imbalance in Distributed Data Serving with Rack-Scale Memory Pooling

From photons to big-data applications: terminating terabits

Contact Info

Product

Resources

About