Non-determinism and overcount on modern hardware performance counter implementations

Weaver, Vincent M.; Terpstra, Dan; Moore, Shirley

doi:10.1109/ispass.2013.6557172

Cited by 103 publications

(76 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Real-world hardware counters usually do not live up to the ideal ones. The undesired deviation from the expected result is usually due to non-determinism (different values for identical runs) and overcount (counting some instructions multiple times) [14]. There are various external sources of these variations including program layout, measurement overhead, multi-processor variations and uncertainty of compilers that may leave many unexplored corner cases.…”

Section: Related Workmentioning

confidence: 99%

A Methodology for Profiling and Partitioning Stream Programs on Many-core Architectures

Michalska¹,

Boutellier²,

Mattavelli³

2015

Procedia Computer Science

View full text Add to dashboard Cite

Maximizing the data throughput is a very common implementation objective for several streaming applications. Such task is particularly challenging for implementations based on many-core and multi-core target platforms because, in general, it implies tackling several NPcomplete combinatorial problems. Moreover, an efficient design space exploration requires an accurate evaluation on the basis of dataflow program execution profiling. The focus of the paper is on the methodology challenges for obtaining accurate profiling measures. Experimental results validate a many-core platform built by an array of Transport Triggered Architecture processors for exploring the partitioning search space based on the execution trace analysis.

show abstract

Section: Related Workmentioning

confidence: 99%

A Methodology for Profiling and Partitioning Stream Programs on Many-core Architectures

Michalska¹,

Boutellier²,

Mattavelli³

2015

Procedia Computer Science

View full text Add to dashboard Cite

show abstract

“…This is because performance counters report slightly different number of instructions even for identical instruction sequences, as reported in [39]. However, our framework is unaffected by this issue.…”

Section: Schedulermentioning

confidence: 93%

What is the cost of weak determinism?

Segulja

Abdelrahman

2014

Proceedings of the 23rd International Conference on Parallel Architectures and Compilation

View full text Add to dashboard Cite

We analyze the fundamental performance impact of enforcing a fixed order of synchronization operations to achieve weak deterministic execution. Our analysis is in three parts, performed on a real system using the SPLASH-2 and PAR-SEC benchmarks. First, we quantify the impact of various sources of nondeterminism on execution of data-race-free programs. We find that thread synchronization is the prevalent source of nondeterminism, sometimes affecting program output. Second, we divorce the implementation overhead of a system imposing a specific synchronization order from the impact of enforcing this order. We show that this fundamental cost of determinism is small (slowdown of 4% on average and 32% in the worst case) and we identify application characteristics responsible for this cost. Finally, we evaluate this cost under perturbed execution conditions. We find that demanding determinism when threads face such conditions can cause almost 2x slowdown.

show abstract

“…Hardware counters are limited on mobile devices. For example, L2 memory counters are not available on many ARM processors [Weaver et al 2013]. This prevents porting analytical methodologies which are relying on hardware performance counters.…”

Section: Methodology Restrictionsmentioning

confidence: 99%

Impact of GC design on power and performance for Android

Hussein

Payer

Hosking

et al. 2015

Proceedings of the 8th ACM International Systems and Storage Conference

View full text Add to dashboard Cite

Small mobile devices have evolved to versatile computing systems. Android devices run a complete software stack, including a full Linux kernel, user land with several software daemons and a virtual machine to run applications. On these mobile systems energy is a scarce resource and needs to be handled carefully. Current systems rely on governors that adjust the frequency of individual cores depending on the system load. We measure energy consumption of different components of this complex software stack, including garbage collection (GC) of the Android virtual machine. Here we propose several extensions to the default GC configuration of Android, including a generational collector, across established dimensions of heap memory size and concurrency.Our evaluation shows that Android's asynchronous GC thread consumes a significant amount of energy. Therefore, varying the GC strategy can reduce total on-chip energy (by 20-30%) whilst slightly impacting application throughput (by 10-40%) and increasing worst-case pause times (by 20-30%). Our work quantifies the direct impact of GC on mobile system, enumerates the main factors and layers of this relationship, and offers a guide for analysis of memory behavior in understanding and tuning system performance.

show abstract

Non-determinism and overcount on modern hardware performance counter implementations

Cited by 103 publications

References 17 publications

A Methodology for Profiling and Partitioning Stream Programs on Many-core Architectures

A Methodology for Profiling and Partitioning Stream Programs on Many-core Architectures

What is the cost of weak determinism?

Impact of GC design on power and performance for Android

Contact Info

Product

Resources

About