Heterogeneous multi-processors are designed to bridge the gap between performance and energy e ciency in modern embedded systems. is is achieved by pairing Out-of-Order (OoO) cores, yielding performance through aggressive speculation and latency masking, with In-Order (InO) cores, that preserve energy through simpler design. By leveraging migrations between them, workloads can therefore select the best se ing for any given energy/delay envelope. However, migrations introduce execution overheads that can hurt performance if they happen too frequently. Finding the optimal migration frequency is critical to maximize energy savings while maintaining acceptable performance. We develop a simulation methodology that can 1) isolate the hardware e ects of migrations from the so ware, 2) directly compare the performance of di erent core types, 3) quantify the performance degradation and 4) calculate the cost of migrations for each case. To showcase our methodology we run mibench, a microbenchmark suite, and show that migrations can happen as fast as every 100k instructions with li le performance loss. We also show that, contrary to numerous recent studies, hypothetical designs do not need to share all of their internal components to be able to migrate at that frequency. Instead, we propose a feasible system that shares level 2 caches and a translation lookaside bu er that matches performance and e ciency. Our results show that there are phases comprising up to 10% that a migration to the OoO core leads to performance bene ts without any additional energy cost when running on the InO core, and up to 6% of phases where a migration to the InO core can save energy without a ecting performance. When considering a policy that focuses on improving the energy-delay product, results show that on average 66% of the phases can be migrated to deliver equal or be er system operation without having to aggressively share the entire memory system or to revert to migration periods ner than 100k instructions. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permi ed. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. Request permissions from permissions@acm.org. Instructions per cycle Fig. 1. An excerpt of execution of dijkstra running on two di erent core types. Sampling at a higher resolution reveals phases that can potentially be exploited to increase e iciency.
INTRODUCTIONIn today's embedded devices, performance is always tied to power constraints and energy e ciency. is happens because, while process technology scaling is enabling larger transistor densities, power per area has shown to increase beyond a certain thr...