Extracting useful features from a scene is an essential step in any computer vision and multimedia data analysis task. Though progress has been made in past decades, it is still quite difficult for computers to comprehensively and accurately recognize an object or pinpoint the more complicated semantics of an image or a video. Thus, feature extraction is expected to remain an active research area in advancing computer vision and multimedia data analysis for the foreseeable future.The approaches in feature extraction can be divided into two categories: model-centric and datadriven. The model-centric approach relies on human heuristics to develop a computer model (or algorithm) to extract features from an image. (We use imagery data as our example throughout this chapter.) Some widely used models are Gabor filter, wavelets, and SIFT [42]. These models were engineered by scientists and then validated via empirical studies. A major shortcoming of the model-centric approach is that unusual circumstances that a model does not take into consideration during its design, such as different lighting conditions and unexpected environmental factors, can render the engineered features less effective. Contrast to the model-centric approach, which dictates representations independent of data, the data-driven approach learns representations from data [10]. Example data-driven algorithms are multilayer perceptron (MLP) and convolutional neural network (CNN), which belong to the general category of neural network and deep learning [27,29].Both model-centric and data-driven approaches employ a model (algorithm or machine). The differences between model-centric and data-driven can be told in two related aspects:• Can data affect model parameters? With model-centric, training data does not affect the model. With data-driven, such as MLP or CNN, their internal parameters are changed/learned based on the discovered structure in large data sets [38].• Can data affect representations? Whereas more data can help a data-driven approach to improve representations, more data cannot change the features extracted by a model-centric approach. For example, the features of an image can be affected by the other images in CNN (because the structure parameters modified through backpropagation are affected by all training images). But the feature set of an image is invariant of the other images in a model-centric pipeline such as SIFT.The greater the quantity and diversity of data, the better the representations can be learned by a data-driven pipeline. In other words, if a learning algorithm has seen enough training instances of an object under various conditions, e.g., in different postures and has been partially occluded, then the features learned from the training data will be more comprehensive. The focus of this chapter is on how neural network, specifically convolutional neural network (CNN), achieves effective representation learning. Neural network, a neuroscience-motivated model, was based on Hubel and Wiesel's research on cats' visual corte...