Abstract-As transistor technology continues to scale, the architecture community has experienced exponential growth in design complexity and significantly increasing implementation and verification costs. Moreover, Moore's law has led to a ubiquitous trend of an increasing number of cores on a single chip. Often, these large-core-count chips provide a shared memory abstraction via directories and coherence protocols, which have become notoriously error-prone and difficult to verify because of subtle data races and state space explosion. Although a very simple hardware shared memory implementation can be achieved by simply not allowing ad-hoc data replication and relying on remote accesses for remotely cached data (i.e., requiring no directories or coherence protocols), such remote-access-based directoryless architectures cannot take advantage of any data locality, and therefore suffer in both performance and energy.Our recently taped-out 110-core shared-memory processor, the Execution Migration Machine (EM²), establishes a new design point. On the one hand, EM² supports shared memory but does not automatically replicate data, and thus preserves the simplicity of directoryless architectures. On the other hand, it significantly improves performance and energy over remoteaccess-only designs by exploiting data locality at remote cores via fast hardware-level thread migration. In this paper, we describe the design choices made in the EM² chip as well as our choice of design methodology, and discuss how they combine to achieve design simplicity and verification efficiency. Even though EM² is a fairly large design-110 cores using a total of 357 million transistors-the entire chip design and implementation process (RTL, verification, physical design, tapeout) took only 18 manmonths.