Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems

Hackenberg, Daniel; Molka, Daniel; Nagel, Wolfgang E.

doi:10.1145/1669112.1669165

Cited by 97 publications

(56 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As multicore, multichip servers are becoming widely used, especially as the number of processor packages increases, it is becoming necessary to revisit the impact of NUMA on the modern CMPs for some emerging workloads. Some recent work has measured NUMA-related performance in the state-ofthe-art multicores using carefully designed synthetic benchmarks [11,26]. On the other hand, there is a wealth of research related to alleviating contention in memory subsystems including cache and bandwidth on current multicores [7,10,15,25,29,[32][33][34][35].…”

Section: Related Workmentioning

confidence: 99%

Optimizing Google's warehouse scale computers: The NUMA experience

Tang

Mars

Xiao

et al. 2013

2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

Optimizing Google's warehouse scale computers: The NUMA experience

Tang

Mars

Xiao

et al. 2013

2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

“…For each data point, the two threads execute in lock step as shown in Figure 5 (similar measurements have been used in existing systems research [18,30,40,73]). Thread y brings the data in a modified state in its local caches and then thread x measures the latency of its own access to the shared data using the timestamp counter of the core [4].…”

Section: Context-to-context Latenciesmentioning

confidence: 99%

Abstracting Multi-Core Topologies with MCTOP

Chatzopoulos

Guerraoui

Harris

et al. 2017

Proceedings of the Twelfth European Conference on Computer Systems

View full text Add to dashboard Cite

Portability and efficiency are usually antagonists in multicore computing. In order to develop efficient code, one needs to take into account the topology of the target multi-cores (e.g., for locality). This clearly hampers code portability. In this paper, we show that you can have the cake and eat it too.We introduce MCTOP, an abstraction of multi-core topologies augmented with important low-level hardware information, such as memory bandwidths and communication latencies. We show how to automatically generate MCTOP using libmctop, our library that leverages the determinism of cache-coherence protocols to infer the topology of multi-cores using only latency measurements.MCTOP enables developers to accurately and portably define high-level performance optimization policies. We illustrate several such policies through four examples: (i-ii) thread placement in OpenMP and in a MapReduce library, (iii) a topology-aware mergesort algorithm, as well as (iv) automatic backoff schemes for locks. We illustrate the portability of these optimizations on five processors from Intel, AMD, and Oracle, with low effort.

show abstract

“…Peng et al [33][34] analyze the memory hierarchy of early dual-core processors from Intel and AMD and demonstrate their respective characteristics. In [28], Hackenberg et al conduct a comprehensive investigation on the cache structures on advanced quad-core multiprocessors. In recent years, comparison between general purpose GPUs is becoming a promising topic.…”

Section: Related Workmentioning

confidence: 99%

Untitled

2014

IJCSIT

View full text Add to dashboard Cite

show abstract

Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems

Cited by 97 publications

References 5 publications

Optimizing Google's warehouse scale computers: The NUMA experience

Optimizing Google's warehouse scale computers: The NUMA experience

Abstracting Multi-Core Topologies with MCTOP

Untitled

Contact Info

Product

Resources

About