Proceedings of the Workshop on Memory Systems Performance and Correctness 2014
DOI: 10.1145/2618128.2618129
|View full text |Cite
|
Sign up to set email alerts
|

Main memory and cache performance of intel sandy bridge and AMD bulldozer

Abstract: Application performance on multicore processors is seldom constrained by the speed of floating point or integer units. Much more often, limitations are caused by the memory subsystem, particularly shared resources such as last level caches or memory controllers. Measuring, predicting and modeling memory performance becomes a steeper challenge with each new processor generation due to the growing complexity and core count. We tackle the important aspect of measuring and understanding undocumented memory perform… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
34
0

Year Published

2014
2014
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 48 publications
(35 citation statements)
references
References 14 publications
1
34
0
Order By: Relevance
“…Faster than address space switches, the cross-core invocations are still expensive. A minimal call/reply invocation requires four cache-line transactions [72] each taking 109-400 cycles [21,70,71] depending on whether the line is transferred between the cores of the same socket or over a cross-socket link. Hence the whole call/reply call takes 448-1988 cycles [72].…”
Section: Isolation Mechanisms and Overheadsmentioning
confidence: 99%
“…Faster than address space switches, the cross-core invocations are still expensive. A minimal call/reply invocation requires four cache-line transactions [72] each taking 109-400 cycles [21,70,71] depending on whether the line is transferred between the cores of the same socket or over a cross-socket link. Hence the whole call/reply call takes 448-1988 cycles [72].…”
Section: Isolation Mechanisms and Overheadsmentioning
confidence: 99%
“…This structure is loaded in memory by the master thread (by default, Linux will place this data on its local memory bank). Therefore, as the number of threads increases, the memory bank that allocates Allocating data in an interleave way does not reduce remote accesses but guarantees a fair share of them between all memory banks and, therefore, prevents access contention, a phenomenon especially prone to happening in these architectures due to reduced memory bandwidth between NUMA nodes [21]. This reason explains why using an allocation policy that reinforces locality between processors and memory banks does not provide good results.…”
Section: Analysis Of Memory Allocation Policiesmentioning
confidence: 99%
“…Different RDMA interfaces, such as OFED [1], uGNI/DMAPP [2], Portals 4 [3], or FlexNIC [4] provide varying levels of support for steering the data at the receiver. Yet, with upcoming terabits-per-second networks [5], we foresee a new bottleneck when it comes to processing the delivered data: A modern CPU requires 10-15ns to access L3 cache (Haswell: 34 cycles, Skylake: 44 cycles [6,7]). However, a 400 Gib/s NIC can deliver a 64-Byte message each 1.2ns.…”
Section: Motivationmentioning
confidence: 99%