Neurons in the mammalian primary visual cortex are known to possess spatially localized, oriented receptive fields. It has previously been suggested that these distinctive properties may reflect an efficient image encoding strategy based on maximizing the sparseness of the distribution of output neuronal activities or alternately, extracting the independent components of natural image ensembles. Here, we show that a strategy for transformation-invariant coding of images based on a first-order Taylor series expansion of an image also causes localized, oriented receptive fields to be learned from natural image inputs. These receptive fields, which approximate localized first-order differential operators at various orientations, allow a pair of cooperating neural networks, one estimating object identity ('what') and the other estimating object transformations ('where'), to simultaneously recognize an object and estimate its pose by jointly maximizing the a posteriori probability of generating the observed visual data. We provide experimental results demonstrating the ability of such networks to factor retinal stimuli into object-centred features and object-invariant transformation estimates.