Human vision is still largely unexplained. Computer vision made impressive progress on this front, but it is unclear to what extent artificial neural networks approximate human brain strategies. Here, we confirm this gap by testing how biological and artificial systems encode object-scene contextual regularities in natural images. Both systems represent these regularities, but the underlying information processing is markedly different. In human vision, objects and backgrounds are represented separately, with rich domain-specific representations characterizing human visual cortex. Interaction between these components occurs downstream in frontoparietal areas. Conversely, neural networks represent image components in a single entangled representation revealing reduced object-segregation abilities and impoverished domain-specific object spaces. These results show the uniqueness of human vision that allows understanding that images are not just a collection of features and points to the need for developing neural network models with a similar richness of representational content.
Human vision is still largely unexplained. Computer vision made impressive progress on this front, but it is still unclear to which extent artificial neural networks approximate human object vision at the behavioral and neural levels. Here, we investigated whether machine object vision mimics the representational hierarchy of human object vision with an experimental design that allows testing within-domain representations for animals and scenes, as well as across-domain representations reflecting their real-world contextual regularities such as animal-scene pairs that often co-occur in the visual environment. We found that DCNNs trained in object recognition acquire representations, in their late processing stage, that closely capture human conceptual judgements about the co-occurrence of animals and their typical scenes. Likewise, the DCNNs representational hierarchy shows surprising similarities with the representational transformations emerging in domain-specific ventrotemporal areas up to domain-general frontoparietal areas. Despite these remarkable similarities, the underlying information processing differs. The ability of neural networks to learn a human-like high-level conceptual representation of object-scene co-occurrence depends upon the amount of object-scene co-occurrence present in the image set thus highlighting the fundamental role of training history. Further, although mid/high-level DCNN layers represent the category division for animals and scenes as observed in VTC, its information content shows reduced domain-specific representational richness. To conclude, by testing within- and between-domain selectivity while manipulating contextual regularities we reveal unknown similarities and differences in the information processing strategies employed by human and artificial visual systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.