Featherlight on-the-fly false-sharing detection

Chabbi, Milind; Wen, Shasha; Liu, Xu

doi:10.1145/3178487.3178499

Cited by 19 publications

(22 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Second, NumaPerf detects more performance issues than the combination of all existing NUMA profilers [9,13,17,23,24,27,29,32]. The performance issues that cannot be detected by existing NUMA profilers are highlighted with a checkmark in the last column of the table, although some can be detected by other tools (but not NUMA tools), such as cache false/true sharing issues [7,12,[20][21][22]. This comparison with existing NUMA profilers is based on the methodology, instead of based on the results of specific tools.…”

Section: Effectivenessmentioning

confidence: 99%

See 1 more Smart Citation

NumaPerf

Zhao

Zhou

Guan

et al. 2021

Proceedings of the ACM International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

It is extremely challenging to achieve optimal performance of parallel applications on a NUMA architecture, which necessitates the assistance of profiling tools. However, existing NUMA-profiling tools share some similar shortcomings, such as portability, effectiveness, and helpfulness issues. This paper proposes a novel profiling tool-NumaPerf-that overcomes these issues. NumaPerf aims to identify potential performance issues for any NUMA architecture, instead of only on the current hardware. To achieve this, NumaPerf focuses on memory sharing patterns between threads, instead of real remote accesses. NumaPerf further detects potential thread migrations and load imbalance issues that could significantly affect the performance but are omitted by existing profilers. NumaPerf also identifies cache coherence issues separately that may require different fix strategies. Based on our extensive evaluation, NumaPerf can identify more performance issues than any existing tool, while fixing them leads to significant performance speedup.

show abstract

Section: Effectivenessmentioning

confidence: 99%

“…It aims for identifying the performance issues for the hybrid DRAM-HBM architecture, but not the NUMA architecture, and has a higher overhead than NumaPerf. Some tools focus on the detection of false/true sharing issues [7,12,[20][21][22], but skipping other NUMA issues.…”

Section: Other Related Toolsmentioning

confidence: 99%

NumaPerf

Zhao

Zhou

Guan

et al. 2021

Proceedings of the ACM International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Even though it has no runtime overhead, it cannot capture all the program objects or their references as it relies on static analysis. Chabbi et al [7] employ PMUs and debug registers to detect false sharing but do not generalize it for inter-thread communication matrices; furthermore, their technique does not quantify communication volume even for false sharing. Even though these tools can count memory access events, they do not associate these events to threads and are not used in generating communication pattern among threads.…”

Section: Related Workmentioning

confidence: 99%

“…Inter-thread communication is an important performance indicator in shared-memory multi-core systems [38]. Thread communication information offers valuable insights: it divulges, to an extent, the inner workings of the program without having to examine the code meticulously; it can be used for identifying possible sources of communication-related performance overhead in parallel applications [7,33]; it can also be used for verifying the multicore hardware design. Therefore, identifying which groups of threads communicate in what volume and their quantitative comparison against expectations offer avenues to tune software for high performance.…”

Section: Introductionmentioning

confidence: 99%

ComDetective

Sasongko

Chabbi

Akhtar

et al. 2019

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

Inter-thread communication is a vital performance indicator in shared-memory systems. Prior works on identifying inter-thread communication employed hardware simulators or binary instrumentation and suffered from inaccuracy or high overheads-both space and time-making them impractical for production use. We propose ComDetective, which produces communication matrices that are accurate and introduces low runtime and low memory overheads, thus making it practical for production use. ComDetective employs hardware performance counters to sample memory-access events and uses hardware debug registers to sample communicating pairs of threads. ComDetective can differentiate communication as true or false sharing between threads. Its runtime and memory overheads are only 1.30× and 1.27×, respectively, for the 18 applications studied under 500K sampling period. Using ComDetective, we produce insightful communication matrices for microbenchmarks, PARSEC benchmark suite, and several CORAL applications and compare the generated matrices against MPI counterparts. Guided by ComDetective, we optimize a few codes and achieve up to 13% speedup. CCS CONCEPTS • General and reference → Performance; • Software and its engineering → Multithreading; • Computer systems organization → Multicore architectures.

show abstract

“…Many applications use calling context to attain better understanding of program behavior. Indeed, the ability to inspect call stack context is an essential part of a wide variety of tools for debugging [4,9,12,13,17,18,24,31], testing [6,20,33], and analyzing [2,11,28,29,38,41,42] modern software.…”

Section: Introductionmentioning

confidence: 99%

Valence: variable length calling context encoding

Zhou

Jantz

Kulkarni

et al. 2019

Proceedings of the 28th International Conference on Compiler Construction

View full text Add to dashboard Cite

Many applications, including program optimizations, debugging tools, and event loggers, rely on calling context to gain additional insight about how a program behaves during execution. One common strategy for determining calling contexts is to use compiler instrumentation at each function call site and return sites to encode the call paths and store them in a designated area of memory. While recent works have shown that this approach can generate precise calling context encodings with low overhead, the encodings can grow to hundreds or even thousands of bytes to encode a long call path, for some applications. Such lengthy encodings increase the costs associated with storing, detecting, and decoding call path contexts, and can limit the effectiveness of this approach for many usage scenarios. This work introduces a new compiler-based strategy that significantly reduces the length of calling context encoding with little or no impact on instrumentation costs for many applications. Rather than update or store an entire word at each function call and return, our approach leverages static analysis and variable length instrumentation to record each piece of the calling context using only a small number of bits, in most cases. We implemented our approach as an LLVM compiler pass, and compared it directly to the state-of-the-art calling context encoding strategy (PCCE) using a standard set of C/C++ applications from SPEC CPU ® 2017. Overall, our approach reduces the length of calling context encoding from 4.3 words to 1.6 words on average (> 60% reduction), thereby improving the efficiency of applications that frequently store or query calling contexts.

show abstract

Featherlight on-the-fly false-sharing detection

Cited by 19 publications

References 30 publications

NumaPerf

NumaPerf

ComDetective

Valence: variable length calling context encoding

Contact Info

Product

Resources

About