Scores of visual attention models have been developed over the past several decades of research. Differences in implementation, assumptions, and evaluations have made comparison of these models very difficult. Taxonomies have been constructed in an attempt at the organization and classification of models, but are not sufficient at quantifying which classes of models are most capable of explaining available data. At the same time, a multitude of physiological and behavioral findings have been published, measuring various aspects of human and non-human primate visual attention. All of these elements highlight the need to integrate the computational models with the data by (1) operationalizing the definitions of visual attention tasks and (2) designing benchmark datasets to measure success on specific tasks, under these definitions. In this paper, we provide some examples of operationalizing and benchmarking different visual attention tasks, along with the relevant design considerations.
To understand how a human operator performs visual search in complex scenes, it is necessary to take into account top-down cognitive biases in addition to bottom-up visual saliency effects. We constructed a model to elucidate the relationship between saliency and cognitive effects in the domain of visual search for distant targets in photo-realistic images of cluttered scenes. In this domain, detecting targets is difficult and requires high visual acuity. Sufficient acuity is only available near the fixation point, i.e. in the fovea. Hence, the choice of fixation points is the most important determinant of whether targets get detected. We developed a model that predicts the 2-D distribution of fixation probabilities directly from an image. Fixation probabilities were computed as a function of local contrast (saliency effect) and proximity to the horizon (cognitive effect: distant targets are more likely to be found close to the horizon). For validation, the model's predictions were compared to ensemble statistics of subjects' actual fixation locations, collected with an eye-tracker. The model's predictions correlated well with the observed data. Disabling the horizon-proximity functionality of the model significantly degraded prediction accuracy, demonstrating that cognitive effects must be accounted for when modeling visual search.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.