In-Memory Computing (IMC) has emerged as a promising paradigm for energy-efficient, throughput-efficient and area-efficient machine learning at the edge. However, the differences in hardware architectures, array dimensions, and fabrication technologies among published IMC realizations have made it difficult to grasp their relative strengths. Moreover, previous studies have primarily focused on exploring and benchmarking the peak performance of a single IMC macro rather than full system performance on real workloads. This paper aims to address the lack of a quantitative comparison of Analog In-Memory Computing (AIMC) and Digital In-Memory Computing (DIMC) processor architectures. We propose an analytical IMC performance model that is validated against published implementations and integrated into a system-level exploration framework for comprehensive performance assessments on different workloads with varying IMC configurations. Our experiments show that while DIMC generally has higher computational density than AIMC, AIMC with large macro sizes may have better energy efficiency than DIMC on convolutional-layers and pointwise-layers, which can exploit high spatial unrolling. On the other hand, DIMC with small macro size outperforms AIMC on depthwiselayers, which feature limited spatial unrolling opportunities inside a macro.Index Terms-Machine learning, quantitative modeling, analog in-memory computing, digital in-memory computing
I. INTRODUCTIONRecent developments of ultra-low power machine learning models have enabled the deployment of artificial intelligence on extreme edge devices. However typical embedded digital accelerators suffer from high data movement costs and low computational densities, degrading the energy efficiencies by up to 2×-1000× with respect to the ideal baseline of digital computations. To minimize the data transfer overhead, inmemory computing (IMC) has recently emerged as a promising alternative to conventional accelerators based on arrays of digital processing elements (PEs). By directly performing the operations near/in the memory cells, these architectures allow to greatly reduce access overheads and enable massive parallelization opportunities, with potential orders of magnitude improvements in energy efficiency and throughput [1]. Most of the initial IMC designs published in the literature are focused on analog IMC (AIMC), where the computation is carried out in the analog domain [2]-[4]. While this approach ensures extreme energy efficiencies and massive parallelization, the analog nature of the computation and the presence O