At the beginning of the 21st century, the processor industry made a fundamentalshift towards multicore architectures, in order to address the diminishingreturns in single-thread performance with increasing transistor counts, and in orderto overcome the severe power problems of clock frequency scaling. Semiconductortechnology trends indicate that now the era of power- and energy-constrainedmanycore architectures has come. Technology projections show that the energyconsumed by data movement and communication will dominate the correspondingbudget of future computing systems; thus, unnecessary data movements willsubtract significant energy margin from computations.The most popular communication model for multi-core and many-core architecturesis shared-memory. Threads or processes that run concurrently on differentcores communicate and exchange data by accessing the same global memory locations.However, accesses to off-chip memory are slow and, thus, processor designsutilize a hierarchy of faster on-chip memories to improve the speed of memory operations.Memory hierarchies today are based on two dominant schemes: (i) multilevelcoherent caches, and (ii) software-managed local memories (scratchpads).Caches manage the memory hierarchy transparently, using hardware replacementpolicies, and communication happens implicitly, with cache-coherence protocolsthat provoke data transfers between caches. Scratchpad memories are controlledby the programmer or the runtime software, and communication happens explicitly,through programmable DMA engines that perform the data transfers.This thesis proposes architectural support in the memory hierarchy to enablethe software to control data locality; we design programmable hardware primitivesthat allow runtime software to orchestrate communication and reduce the associatedenergy consumption.We demonstrate a hybrid cache/scratchpad memory hierarchy that providesunified hardware support for both implicit communication, via cache-coherence,and explicit communication, via fast virtualized inter-processor communicationhardware primitives. We also introduce the Epoch-based Cache Management(ECM), which allows software to assign priorities to cache-lines, in order to guidethe cache replacement policy, and, in effect, to manage locality. Moreover, wedesign the Explicit Bulk Prefetcher (EBP), a programmable prefetch engine thatallows software to accurately prefetch data ahead of time, in order to hide memory latency and improve cache locality. Furthermore, we propose a set of hardwareprimitives for Software Guided Coherence (SGC) in non-cache-coherent systems,in order to allow runtime software to orchestrate the fetching of the most up-todateversion of data from the appropriate cache(s) and maintain coherence at thesoftware object granularity.We evaluate our proposed hardware primitives by comparing them againstdirectory-based cache-coherence with hardware prefetching. Our experimental resultsfor explicit communication show that we can improve performance by 10% to40%, and at the same time reduce the energy consumption of on-chip communicationby 35% to 70% owing to significant reduction in on-chip traffic, by factors of2 to 4. Moreover, we exploit a task-based programming system to guide hardware,and show that our proposed hardware primitives in cache-coherent systems (ECM,EBP) improve performance by an average of 20%, inject 25% less on-chip trafficon average, and reduce the energy consumption in the components of the memoryhierarchy by an average of 28%. Our hardware support for non-cache-coherent systems(ECM, SGC) improves performance by an average of 14%, injects 41% lesson-chip traffic on average, and reduces the energy consumption in the componentsof the memory hierarchy by an average of 44%.