Visual understanding of real-world scenes is near-instantaneous. Humans can extracta wealth of information, including spatial structure, semantic category, and theidentity of embedded objects, from images viewed for fewer than 100 msecs. Visualprocessing has capacity limits, and, as a result, the computational processes thatunderlie this behaviour must be highly efficient. Computational theories of realworldscene perception model early image processing in various ways. In Chapter 1,I review these theories, and in Chapter 2, I review the role of depth cues in rapidvisual processing. This discussion reveals three problems: (i) Tests of the agreementbetween model predictions and human responses may be biased by the arbitrarychoice of category system, (ii) Current models posit that scene semantics isestimated from spatial structure properties, but empirical support for this position isinconsistent, and (iii) The time-course of depth estimation in real-world scenes ispoorly understood. To address these problems, three empirical papers are presentedin Chapters 3, 4, and 5. In Chapter 3, I propose and validate a novel clusteringalgorithm that can be applied to image databases to derive category systems forvisual experiments. In Chapters 3 and 4, I examine the relationship between spatialstructure and semantic information, and find little support for the position that spatialstructure properties inform semantic discrimination. In Chapters 4 and 5, Icharacterize the time-course of depth processing for images presented for <267msecs, and conclude that binocular disparity and elevation cues contribute to realworldperception shortly after image onset (<50 msecs). These findings are discussedtogether in Chapter 6. This thesis contributes to the evaluation of modern models ofreal-world scene perception, and helps to characterize how visual understandingunfolds over time.