A classic puzzle in understanding visual scene perception is how to reconcile the physiological constraints of vision with the phenomenology of seeing. Vision captures information via discrete eye fixations, interrupted by saccadic suppression, and limited by retinal inhomogeneity. Yet scenes are effortlessly perceived as coherent, continuous, and meaningful. Two conceptualizations of scene representation will be contrasted. The traditional visual-cognitive model casts visual scene representation as an imperfect reflection of the visual sensory input alone. By contrast, a new multisource model casts visual scene representation in terms of an egocentric spatial framework that is 'filled-in' by visual sensory input, but also by amodal perception, and by expectations and by constraints derived from rapid-scene classification and object-to-context associations. Together, these nonvisual sources serve to 'simulate' a likely surrounding scene that the visual input only partially reveals. Pros and cons of these alternative views will be discussed. WIREs Cogn Sci 2012, 3:117-127. doi: 10.1002/wcs.149 For further resources related to this article, please visit the WIREs website.