Recent success in deep neural networks has generated strong interest in hardware accelerators to improve speed and energy consumption. This paper presents a new type of photonic accelerator based on coherent detection that is scalable to large (N 10 6 ) networks and can be operated at high (GHz) speeds and very low (sub-aJ) energies per multiply-and-accumulate (MAC), using the massive spatial multiplexing enabled by standard free-space optical components. In contrast to previous approaches, both weights and inputs are optically encoded so that the network can be reprogrammed and trained on the fly. Simulations of the network using models for digit-and image-classification reveal a "standard quantum limit" for optical neural networks, set by photodetector shot noise. This bound, which can be as low as 50 zJ/MAC, suggests performance below the thermodynamic (Landauer) limit for digital irreversible computation is theoretically possible in this device. The proposed accelerator can implement both fully-connected and convolutional networks. We also present a scheme for back-propagation and training that can be performed in the same hardware. This architecture will enable a new class of ultra-low-energy processors for deep learning.In recent years, deep neural networks have tackled a wide range of problems including image analysis [1], natural language processing [2], game playing [3], physical chemistry [4], and medicine [5]. This is not a new field, however. The theoretical tools underpinning deep learning have been around for several decades [6,7,8]; the recent resurgence is driven primarily by (1) the availability of large training datasets [9], and (2) substantial growth in computing power [10] and the ability to train networks on GPUs [11]. Moving to more complex problems and higher network accuracies requires larger and deeper neural networks, which in turn require even more computing power [12]. This motivates the development of special-purpose hardware optimized to perform neural-network inference and training [13].To outperform a GPU, a neural-network accelerator must significantly lower the energy consumption, since the performance of modern microprocessors is limited by on-chip power [14]. In addition, the system must be fast, programmable, scalable to many neurons, compact, and ideally compatible with training as well as inference. Application-specific integrated circuits (ASICs) are one obvious candidate for this task. Stateof-the-art ASICs can reduce the energy per multiply-and-accumulate (MAC) from 20 pJ/MAC for modern GPUs [15] to around 1 pJ/MAC [16,17]. However, ASICs are based on CMOS technology and therefore suffer from the interconnect problem-even in highly optimized architectures where data is stored in register files close to the logic units, a majority of the energy consumption comes from data movement, not logic [13,16]. Analog crossbar arrays based on CMOS gates [18] or memristors [19,20] promise better performance, but as analog electronic devices, they suffer from calibration issues and li...