Understanding how people comprehend visual narratives (including picture stories, comics, and film) requires the combination of traditionally separate theories that span the initial sensory and perceptual processing of complex visual scenes, the perception of events over time, and comprehension of narratives. Existing piecemeal approaches fail to capture the interplay between these levels of processing. Here, we propose the Scene Perception & Event Comprehension Theory (SPECT), as applied to visual narratives, which distinguishes between front‐end and back‐end cognitive processes. Front‐end processes occur during single eye fixations and are comprised of attentional selection and information extraction. Back‐end processes occur across multiple fixations and support the construction of event models, which reflect understanding of what is happening now in a narrative (stored in working memory) and over the course of the entire narrative (stored in long‐term episodic memory). We describe relationships between front‐ and back‐end processes, and medium‐specific differences that likely produce variation in front‐end and back‐end processes across media (e.g., picture stories vs. film). We describe several novel research questions derived from SPECT that we have explored. By addressing these questions, we provide greater insight into how attention, information extraction, and event model processes are dynamically coordinated to perceive and understand complex naturalistic visual events in narratives and the real world.