We develop a visuomotor model that implements visual search as a focal accuracy-seeking policy, with the target's position and category drawn independently from a common generative process. Consistently with the anatomical separation between the ventral versus dorsal pathways, the model is composed of two pathways that respectively infer what to see and where to look. The "What" network is a classical deep learning classifier that only processes a small region around the center of fixation, providing a "foveal" accuracy. In contrast, the "Where" network processes the full visual field in a biomimetic fashion, using a log-polar retinotopic encoding, which is preserved up to the action selection level. In our model, the foveal accuracy is used as a monitoring signal to train the "Where" network, much like in the "actor/critic" framework. After training, the "Where" network provides an "accuracy map" that serves to guide the eye toward peripheral objects. Finally, the comparison of both networks' accuracies amounts to either selecting a saccade or keeping the eye focused at the center to identify the target. We test this setup on a simple task of finding a digit in a large, cluttered image. Our simulation results demonstrate the effectiveness of this approach, increasing by one order of magnitude the radius of the visual field toward which the agent can detect and recognize a target, either through a single saccade or with multiple ones. Importantly, our log-polar treatment of the visual information exploits the strong compression rate performed at the sensory level, providing ways to implement visual search in a sublinear fashion, in contrast with mainstream computer vision.
Visual search involves a dual task of localizing and categorizing an object in the visual field of view. We develop a visuo-motor model that implements visual search as a focal accuracyseeking policy, and we assume that the target position and category are random variables which are independently drawn from a common generative process. This independence allows to divide the visual processing in two pathways that respectively infer what to see and where to look, consistently with the anatomical What versus Where separation. We use this dual principle to train a deep neural network architecture with the foveal accuracy used as a monitoring signal for action selection. This allows in particular to interpret the Where network as a retinotopic action selection pathway, that drives the fovea toward the target position in order to increase the recognition accuracy by the What network. After training, the comparison of both networks accuracies amounts either to select a saccade or to keep the eye focused at the center, so as to identify the target. We test this on a simple task of finding digits in a large, cluttered image. A biomimetic log-polar treatment of the visual information implements the strong compression rate performed at the sensor level by retinotopic encoding, and is preserved up to the action selection level. Simulation results demonstrate that it is possible to learn this dual network. After training, this dual approach provides ways to implement visual search in a sub-linear fashion, in contrast with mainstream computer vision. Author summaryThe visual search task consists in extracting a scarce and specific visual information (the 1 "target") from a large and cluttered visual display. In computer vision, this task is usually 2 implemented by scanning all different possible target identities in parallel at all possible 3 spatial positions, hence with strong computational load. The human visual system employs a 4 different strategy, combining a foveated sensor with the capacity to rapidly move the center 5 of fixation using saccades. Then, visual processing is separated in two specialized pathways, 6 the "where" pathway mainly conveying information about target position in peripheral space 7 (independently of its category), and the "what" pathway mainly conveying information about 8
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.