Understanding natural scenes involves the contribution of bottom-up analysis and top-down modulatory processes. However, the interaction of these processes during the categorization of natural scenes is not well understood. In the current study, we approached this issue using ERPs and behavioral and computational data. We presented pictures of natural scenes and asked participants to categorize them in response to different questions (Is it an animal/vehicle? Is it indoors/outdoors? Are there one/two foreground elements?). ERPs for target scenes requiring a "yes" response began to differ from those of nontarget scenes, beginning at 250 msec from picture onset, and this ERP difference was unmodulated by the categorization questions. Earlier ERPs showed category-specific differences (e.g., between animals and vehicles), which were associated with the processing of scene statistics. From 180 msec after scene onset, these category-specific ERP differences were modulated by the categorization question that was asked. Categorization goals do not modulate only later stages associated with target/nontarget decision but also earlier perceptual stages, which are involved in the processing of scene statistics.
The investigation of visual categorization has recently been aided by the introduction of deep convolutional neural networks (CNNs), which achieve unprecedented accuracy in picture classification after extensive training. Even if the architecture of CNNs is inspired by the organization of the visual brain, the similarity between CNN and human visual processing remains unclear. Here, we investigated this issue by engaging humans and CNNs in a two‐class visual categorization task. To this end, pictures containing animals or vehicles were modified to contain only low/high spatial frequency (HSF) information, or were scrambled in the phase of the spatial frequency spectrum. For all types of degradation, accuracy increased as degradation was reduced for both humans and CNNs; however, the thresholds for accurate categorization varied between humans and CNNs. More remarkable differences were observed for HSF information compared to the other two types of degradation, both in terms of overall accuracy and image‐level agreement between humans and CNNs. The difficulty with which the CNNs were shown to categorize high‐passed natural scenes was reduced by picture whitening, a procedure which is inspired by how visual systems process natural images. The results are discussed concerning the adaptation to regularities in the visual environment (scene statistics); if the visual characteristics of the environment are not learned by CNNs, their visual categorization may depend only on a subset of the visual information on which humans rely, for example, on low spatial frequency information.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.