Abstract-HeterogeneousMPSoCs where different types of cores share a baseline ISA but implement different operational accelerators combine programmability with flexible customization. They hold promise for high performance under power and area limitations. However, transparent binary execution and dynamic scheduling is hard on those platforms. The stateof-the-art approach for transparent accelerated execution is fault-and-migrate (FAM): when a thread executes an accelerating instruction unavailable on the host core, it is forcibly migrated to an accelerating core which implements the instruction natively. Unfortunately, this approach prohibits dynamic scheduling through flexible thread migration, which is essential to any asymmetric platform for efficient utilization of heterogeneous resources.We present two distinct binary-level techniques -Dynamic Binary Rewriting (DBR) and Dynamic Binary Translation (DBT) -which enable selective acceleration, while preserving transparent thread execution and migration, to any core in the system, at any point in time. DBR rewrites binary code to exploit any accelerating instructions available in the host core. DBT implements a fault-and-rewrite scheme, which sets up trampolines to emulation routines for these accelerating instructions which are not available on the host core. Both methods customize binary code on demand, enabling flexible migration.We evaluate the overhead of DBR and DBT against FAM on a real hardware shared-ISA MPSoC prototype. Experiments with single-thread programs show flexible migration is possible with manageable overhead. We measure the performance of our binary-level techniques by artificially triggering periodic thread migration between a Base and an accelerating (ACC) core. Periodic migration, without aiming for optimized scheduling, results in an average slowdown of about 40% under DBR or about 10% under DBT, compared to FAM driven scheduling. We also show results for a speedup proportional dynamic scheduler, enabled by our techniques, using multi-program workloads. In this case, up to 50% faster execution times can be achieved by leveraging flexible thread migration.