Today's server architectures are designed considering the needs of a wide range of applications. For example, superscalar processors include complex control logic for out of order execution to extract instruction-level parallelism (ILP) from arbitrary programs. However, not all workloads utilize the features of a superscalar processor effectively. For example, a workload that exhibits a regular execution pattern (e.g. a dense linear algebra kernel) may not require the expensive ILP control logic for parallelism. Instead, it can be run on a throughput-oriented architecture with thousands of simple cores, such as a GPU, which can lead to much better performance and power efficiency. On the other hand, only a limited class of data-parallel applications can utilize the high throughputs provided by such architectures. As a matter of fact, existing CPU and GPU platforms may not be the most efficient choices for the compute patterns of a wide range of applications.For big data workloads, access to data is typically at least as important bottleneck as computation. The memory subsystems of today's CPU architectures are optimized for workloads that have reasonable data access locality. CPU cache hierarchies include different sizes of caches, which help capture different levels of access localities in different applications. However, if an application exhibits very little or no locality, the data access operations become inefficient for these architectures.As an example, let us consider graph applications that run on very large and unstructured datasets. Typically, the data of a vertex is computed/updated based on the data of its neighbors. In an unstructured graph, the neighbors of a vertex are stored in memory locations that may be far from each other. So, traversing the neighbors of a vertex may involve a random memory access per neighbor. If the graph is large enough so that it does not fit into the last level cache (LLC), each access to a neighbor's data may require a random DRAM access, which is typically hundreds of clock cycles. However, existing CPU architectures are not optimized for frequent random DRAM accesses. For example, each Intel Haswell Xeon core has 10 line-fill-buffers (LFBs), which means that each core can handle at most 10 L1 cache misses at a given time. However, an off-chip DRAM latency of hundreds of cycles requires hundreds of outstanding memory requests to be able to utilize the full DRAM bandwidth available in the system [1]. It was reported that 10 or more Xeon cores were needed for various graph applications to fully utilize the available DRAM bandwidth [2]. Furthermore, due to the low compute to memory-access ratios in graph applications, these cores are frequently stalled while waiting for data from off-chip memory. This leads to high power consumption by 10+ superscalar cores while not doing useful work. It was shown that custom architectures that target such communication patterns have the potential to improve power efficiency by a factor of 50x or more compared to the general-purpose CPU...