Over successive stages, the ventral visual system develops neurons that respond with view, size and position invariance to objects including faces. A major challenge is to explain how invariant representations of individual objects could develop given visual input from environments containing multiple objects. Here we show that the neurons in a 1-layer competitive network learn to represent combinations of three objects simultaneously present during training if the number of objects in the training set is low (e.g. 4), to represent combinations of two objects as the number of objects is increased to for e.g. 10, and to represent individual objects as the number of objects in the training set is increased further to for e.g. 20. We next show that translation invariant representations can be formed even when multiple stimuli are always present during training, by including a temporal trace in the learning rule. Finally, we show that these concepts can be extended to a multi-layer hierarchical network model (VisNet) of the ventral visual system. This approach provides a way to understand how a visual system can, by self-organizing competitive learning, form separate invariant representations of each object even when each object is presented in a scene with multiple other objects present, as in natural visual scenes.
Experimental studies have provided evidence that the visual processing areas of the primate brain represent facial identity and facial expression within different subpopulations of neurons. For example, in non-human primates there is evidence that cells within the inferior temporal gyrus (TE) respond primarily to facial identity, while cells within the superior temporal sulcus (STS) respond to facial expression. More recently, it has been found that the orbitofrontal cortex (OFC) of non-human primates contains some cells that respond exclusively to changes in facial identity, while other cells respond exclusively to facial expression. How might the primate visual system develop physically separate representations of facial identity and expression given that the visual system is always exposed to simultaneous combinations of facial identity and expression during learning? In this paper, a biologically plausible neural network model, VisNet, of the ventral visual pathway is trained on a set of carefully-designed cartoon faces with different identities and expressions. The VisNet model architecture is composed of a hierarchical series of four Self-Organising Maps (SOMs), with associative learning in the feedforward synaptic connections between successive layers. During learning, the network develops separate clusters of cells that respond exclusively to either facial identity or facial expression. We interpret the performance of the network in terms of the learning properties of SOMs, which are able to exploit the statistical indendependence between facial identity and expression.
We show how hand-centred visual representations could develop in the primate posterior parietal and premotor cortices during visually guided learning in a self-organizing neural network model. The model incorporates trace learning in the feed-forward synaptic connections between successive neuronal layers. Trace learning encourages neurons to learn to respond to input images that tend to occur close together in time. We assume that sequences of eye movements are performed around individual scenes containing a fixed hand-object configuration. Trace learning will then encourage individual cells to learn to respond to particular hand-object configurations across different retinal locations. The plausibility of this hypothesis is demonstrated in computer simulations.
We show in a unifying computational approach that representations of spatial scenes can be formed by adding an additional self-organizing layer of processing beyond the inferior temporal visual cortex in the ventral visual stream without the introduction of new computational principles. The invariant representations of objects by neurons in the inferior temporal visual cortex can be modelled by a multilayer feature hierarchy network with feedforward convergence from stage to stage, and an associative learning rule with a short-term memory trace to capture the invariant statistical properties of objects as they transform over short time periods in the world. If an additional layer is added to this architecture, training now with whole scenes that consist of a set of objects in a given fixed spatial relation to each other results in neurons in the added layer that respond to one of the trained whole scenes but do not respond if the objects in the scene are rearranged to make a new scene from the same objects. The formation of these scene-specific representations in the added layer is related to the fact that in the inferior temporal cortex and, we show, in the VisNet model, the receptive fields of inferior temporal cortex neurons shrink and become asymmetric when multiple objects are present simultaneously in a natural scene. This reduced size and asymmetry of the receptive fields of inferior temporal cortex neurons also provides a solution to the representation of multiple objects, and their relative spatial positions, in complex natural scenes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.