Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). Recent advances on machine-learning show that neural networks are the state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a neural network accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope.Until now, most machine-learning accelerator designs have been focusing on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art neural networks are characterized by their large size. In this study, we design an accelerator architecture for large-scale neural networks, with a special emphasis on the impact of memory on accelerator design, performance and energy. We present a concrete design at 65nm which can perform 496 16-bit fixed-point operations in parallel every 1.02ns, i.e., 452 GOP/s, in a 3.02mm 2 , 485mW footprint (excluding main memory accesses).