The recent phenomenal interest in convolutional neural networks (CNNs) must have made it inevitable for the super-resolution (SR) community to explore its potential. The response has been immense and in the last three years, since the advent of the pioneering work, there appeared too many works not to warrant a comprehensive survey. This paper surveys the SR literature in the context of deep learning. We focus on the three important aspects of multimedia -namely image, video and multi-dimensions, especially depth maps. In each case, first relevant benchmarks are introduced in the form of datasets and state of the art SR methods, excluding deep learning. Next is a detailed analysis of the individual works, each including a short description of the method and a critique of the results with special reference to the benchmarking done. This is followed by minimum overall benchmarking in the form of comparison on some common dataset, while relying on the results reported in various works. Fig. 1: Backpropagation (after [7]).acceptable level of convergence whereby the optimized parameters should ideally classify each subsequent test case correctly. The birth of Convolutional neural networks (CNN) or ConvNets can be traced back to 1988 [15] 1 wherein backpropagation was employed to train a NN to classify handwritten digits. Subsequent works by LeCun evolved into what was later known as LeNet5 [17]. After that there's virtual lull till late noughties [18] when GPUs were efficient enough to culminate in the work [19]. Since then a floodgate has opened and we hear of various architectures in the form of AlexNet [20], ZFNet [21], GoogLeNet [22] DenseNet [23] etc.; for a detailed overview one can consult [18], [24].The metamorphosis from fully connected NN to locally connected NN to CNN is illustrated in Fig. 2. As can be seen, rather than being fully connected, the CNN employs convolutions leading to local connections, where each local region of the input is connected to a neuron in the output. The input to a CNN is in the form of multiple arrays, such as a color image with three 2D arrays (length × width) in accordance to RGB or YCbCr channels. The number of channels is called depth and constitutes the 3rd D; note that more than three channels are not uncommon, e.g with hyperspectral images. A CNN is made up of Layers with each layer transforming an input 3D volume to an output 3D volume [11], typically, via four distinct operations [25], viz. convolution, a non-linear activation function (ReLU), pooling or sub-sampling and classification (fully connected Layer). A simplified CNN is illustrated in Fig. 3 2 . A CNN can be described as several convolution layers with nonlinear activation functions (e.g. ReLU or sigmoid) applied to each layer. Each convolution layer applies several (may be thousands) distinct filters 3 (also called feature maps) and combines their results. These filters are automatically learnt during the training part based on the task in hand, e.g. if the task is image classification the learning concerns, a...