“…Such models are originally trained to predict an image label from a vector representation encoding the pixel-based RGB values of the respective image (see the upper-right part of Figure 2), and have reached impressive levels of performance in this task (Chatfield, Simonyan, Vedaldi, & Zisserman, 2014;Krizhevsky et al, 2012). Furthermore, representations obtained from such models have been validated as measures of visual similarity (Petilli, Günther, Vergallito, Ciaparelli, & Marelli, 2019), and it has been shown that they closely correspond to human intuitions (Bracci, Ritchie, Kalfas, & de Beeck, 2019;Lazaridou, Marelli, & Baroni, 2017;Phillips et al, 2018;Zhang, Isola, Efros, Shechtman, & Wang, 2018).…”