Scene recognition is a core sensory capacity that enables humans to adaptively interact with their environment. Despite substantial progress in the understanding of the neural representations underlying scene recognition, it remains unknown how these representations translate into behavior given different task demands. To address this, we aimed to identify behaviorally relevant scene representations, to characterize them in terms of their underlying visual features, and to reveal how they vary given different tasks. We recorded fMRI data while human participants viewed manmade and natural scenes and linked brain responses to behavior in one of two tasks acquired in a separate set of subjects: a manmade/natural categorization task or an orthogonal task on fixation. First, we found correlations between scene categorization response times (RTs) and scene-specific brain responses, quantified as the distance to a hyperplane derived from a multivariate classifier, in occipital and ventral-temporal, but not parahippocampal cortex. This suggests that representations in early visual and object-selective cortex are relevant for scene categorization. Next, we revealed that mid-level visual features, as quantified using deep convolutional neural networks, best explained the relationship between scene representations and behavior, indicating that these features are read out in scene categorization. Finally, we observed opposite patterns of correlations between brain responses and RTs in the categorization and orthogonal task, suggesting a critical influence of task on the behavioral relevance of scene representations. Together, these results reveal the spatial extent, content, and task-dependence of the visual representations that mediate behavior in complex scenes.