large and unavoidable time and energy penalties for data transport between memory and computational blocks-the so-called "von Neumann" bottleneck.Analog hardware accelerators for deep learning can avoid this bottleneck by performing MAC operations in memory using a crossbar array structure. [5][6][7] In these crossbar arrays, nonvolatile memory (NVM) elements are used to encode synaptic weights. This was recently shown capable of 280× speedup in per area throughput while also providing 100× enhancement in per area energy efficiency over state-of-the-art GPUs. [8] While the benefits of speed are readily appreciated, enhancement in energy efficiency can be a powerful driver of purchasing decisions in the data center space as well. [9][10][11][12] Crossbar arrays have been implemented using a variety of analog NVM elements, including resistive RAM (ReRAM), [13,14] conductive-bridging RAM (CBRAM), [15] flash, [16][17][18][19][20] and phase-change memory (PCM). [21,22] However, most NVM device candidates exhibit varying extents of nonideal behavior, including limited resistance contrast, significant nonlinearity in conductance change, as well as strong asymmetry in bidirectional programming. At the application level, this manifests into neural network accuracies that are significantly lower than software-based approaches. Many of these nonidealities, however, can be addressed to an extent through device and circuit-level engineering. One key idea is multiple conductances of varying significance (Figure 1a), where each synaptic weight is distributed across at least two conductance pairs using W = F(G + − G − ) + (g + − g − ). For instance, during training, such a PCM cell could be paired with a capacitive cell to combine complementary features of both and relax the overall device requirements on PCM. [8] In such a scheme, the majority of DNN weight tuning can occur on a linear but volatile 3T1C memory structure, with periodic but infrequent transfer to the PCM cell for nonvolatile storage. Row-(or column-)wise weight transfer also arises in ex situ training and other training variants. Since PCM is subject to the previously mentioned nonidealities, this weight transfer is best performed with a closed-loop iterative tuning procedure with multiple write-plus-read-verify steps to achieve the overall target weight. To avoid performance degradation, particularly during training, NVM programming Crossbar arrays of nonvolatile memory (NVM) can potentially accelerate development of deep neural networks (DNNs) by implementing crucial multiply-accumulate (MAC) operations at the location of data. Effective weight-programming procedures can both minimize the performance impact during training and reduce the down time for inference, where new parameter sets may need to be loaded. Simultaneous weight programming along an entire dimension (e.g., row or column) of a crossbar array in the context of forward inference and training is shown to be important. A framework for determining the optimal hardware conditions in which to program w...