The fth release of the multithreaded language Cilk uses a provably good \work-stealing" scheduling algorithm similar to the rst system, but the language has been completely redesigned and the runtime system completely reengineered. The e ciency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this \work-rst" principle has led to a portable Cilk-5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the work-rst principle was exploited in the design of Cilk-5's compiler and its runtime system. In particular, we present Cilk-5's novel \two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler.