Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of execution time of CNN training, and GPUs are commonly used to accelerate these layer workloads. GPU design optimization for efficient CNN training acceleration requires the accurate modeling of how their performance improves when computing and memory resources are increased. We present DeLTA, the first analytical model that accurately estimates the traffic at each GPU memory hierarchy level, while accounting for the complex reuse patterns of a parallel convolution algorithm. We demonstrate that our model is both accurate and robust for different CNNs and GPU architectures. We then show how this model can be used to carefully balance the scaling of different GPU resources for efficient CNN performance improvement.Index Terms-GPU, memory system, deep learning, CNN• We introduce DeLTA, a GPU performance model for CNNs. Unlike prior work, DeLTA accurately models traffic across all memory hierarchy levels, capturing the data reuse at the different levels; accurately modeling memory traffic is critical for future GPU designs where compute throughput and memory bandwidth must be balanced. • We are first to analyze and model the memory access pattern of the im2col convolution algorithm, which is the most-