Main memory and cache performance of intel sandy bridge and AMD bulldozer

Molka, Daniel; Hackenberg, Daniel; Schöne, Robert

doi:10.1145/2618128.2618129

Cited by 48 publications

(35 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Faster than address space switches, the cross-core invocations are still expensive. A minimal call/reply invocation requires four cache-line transactions [72] each taking 109-400 cycles [21,70,71] depending on whether the line is transferred between the cores of the same socket or over a cross-socket link. Hence the whole call/reply call takes 448-1988 cycles [72].…”

Section: Isolation Mechanisms and Overheadsmentioning

confidence: 99%

Lightweight kernel isolation with virtualization and VM functions

Narayanan

Huang

Tan

et al. 2020

Proceedings of the 16th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

View full text Add to dashboard Cite

Commodity operating systems execute core kernel subsystems in a single address space along with hundreds of dynamically loaded extensions and device drivers. Lack of isolation within the kernel implies that a vulnerability in any of the kernel subsystems or device drivers opens a way to mount a successful attack on the entire kernel.Historically, isolation within the kernel remained prohibitive due to the high cost of hardware isolation primitives. Recent CPUs, however, bring a new set of mechanisms. Extended page-table (EPT) switching with VM functions and memory protection keys (MPKs) provide memory isolation and invocations across boundaries of protection domains with overheads comparable to system calls. Unfortunately, neither MPKs nor EPT switching provide architectural support for isolation of privileged ring 0 kernel code, i.e., control of privileged instructions and well-defined entry points to securely restore state of the system on transition between isolated domains.Our work develops a collection of techniques for lightweight isolation of privileged kernel code. To control execution of privileged instructions, we rely on a minimal hypervisor that transparently deprivileges the system into a non-root VT-x guest. We develop a new isolation boundary that leverages extended page table (EPT) switching with the VMFUNC instruction. We define a set of invariants that allows us to isolate kernel components in the face of an intricate execution model of the kernel, e.g., provide isolation of preemptable, concurrent interrupt handlers. To minimize overheads of virtualization, we develop support for exitless interrupt delivery across isolated domains. We evaluate our

show abstract

Section: Isolation Mechanisms and Overheadsmentioning

confidence: 99%

Lightweight kernel isolation with virtualization and VM functions

Narayanan

Huang

Tan

et al. 2020

Proceedings of the 16th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

View full text Add to dashboard Cite

show abstract

“…This structure is loaded in memory by the master thread (by default, Linux will place this data on its local memory bank). Therefore, as the number of threads increases, the memory bank that allocates Allocating data in an interleave way does not reduce remote accesses but guarantees a fair share of them between all memory banks and, therefore, prevents access contention, a phenomenon especially prone to happening in these architectures due to reduced memory bandwidth between NUMA nodes [21]. This reason explains why using an allocation policy that reinforces locality between processors and memory banks does not provide good results.…”

Section: Analysis Of Memory Allocation Policiesmentioning

confidence: 99%

A performance comparison of data and memory allocation strategies for sequence aligners on NUMA architectures

Lenis

Senar

2017

Cluster Comput

View full text Add to dashboard Cite

Over the last several years, many sequence alignment tools have appeared and become popular for the fast evolution of next generation sequencing technologies. Obviously, researchers that use such tools are interested in getting maximum performance when they execute them in modern infrastructures. Today's NUMA (Non-uniform memory access) architectures present major challenges in getting such applications to achieve good scalability as more processors/cores are used. The memory system in NUMA systems shows a high complexity and may be the main cause for the loss of an application's performance. The existence of several memory banks in NUMA systems implies a logical increase in latency associated with the accesses of a given processor to a remote bank. This phenomenon is usually attenuated by the application of strategies that tend to increase the locality of memory accesses. However, NUMA systems may also suffer from contention problems that can occur when concurrent accesses are concentrated on a reduced number of banks. Sequence alignment tools use large data structures to contain reference genomes to which all reads are aligned. Therefore, these tools are very sensitive to performance problems related to the memory system. The main goal of this study is to explore the trade-offs between data locality and data dispersion in NUMA systems. We have performed experiments with several popular sequence alignment tools on two This work has been supported by MINECO-Spain under contract TIN2014-53234-C2-1-R. widely available NUMA systems to assess the performance of different memory allocation policies and data partitioning strategies. We find that there is not one method that is best in all cases. However, we conclude that memory interleaving is the memory allocation strategy that provides the best performance when a large number of processors and memory banks are used. In the case of data partitioning, the best results are usually obtained when the number of partitions used is greater, sometimes combined with an interleave policy.

show abstract

“…Different RDMA interfaces, such as OFED [1], uGNI/DMAPP [2], Portals 4 [3], or FlexNIC [4] provide varying levels of support for steering the data at the receiver. Yet, with upcoming terabits-per-second networks [5], we foresee a new bottleneck when it comes to processing the delivered data: A modern CPU requires 10-15ns to access L3 cache (Haswell: 34 cycles, Skylake: 44 cycles [6,7]). However, a 400 Gib/s NIC can deliver a 64-Byte message each 1.2ns.…”

Section: Motivationmentioning

confidence: 99%

sPIN

Hoefler

Girolamo

Taranov

et al. 2017

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Optimizing communication performance is imperative for largescale computing because communication overheads limit the strong scalability of parallel applications. Today's network cards contain rather powerful processors optimized for data movement. However, these devices are limited to fixed functions, such as remote direct memory access. We develop sPIN, a portable programming model to offload simple packet processing functions to the network card. To demonstrate the potential of the model, we design a cycle-accurate simulation environment by combining the network simulator Log-GOPSim and the CPU simulator gem5. We implement offloaded message matching, datatype processing, and collective communications and demonstrate transparent full-application speedups. Furthermore, we show how sPIN can be used to accelerate redundant in-memory filesystems and several other use cases. Our work investigates a portable packet-processing network acceleration model similar to compute acceleration with CUDA or OpenCL. We show how such network acceleration enables an eco-system that can significantly speed up applications and system services.ACM Reference Format:

show abstract

Main memory and cache performance of intel sandy bridge and AMD bulldozer

Cited by 48 publications

References 14 publications

Lightweight kernel isolation with virtualization and VM functions

Lightweight kernel isolation with virtualization and VM functions

A performance comparison of data and memory allocation strategies for sequence aligners on NUMA architectures

sPIN

Contact Info

Product

Resources

About