From an early age, humans are challenged with evaluating rich environments full of socially and physically grounded concepts. For example, we might be spectating a rapidly unfolding tennis match, anticipating ball trajectories based on players’ body cues and goals. In another scenario, we may engage with long storylines, juggling the mental states of characters with varying knowledge of an unfolding conflict. The complexity of this learning problem is notable as it can be multimodal, integrate information at varying timescales, and implicitly co-attend to social and physical scene properties for downstream reasoning. Large language-vision models like GPT4-V, LLaMA-3, which use vision-language embeddings, show skills in commonsense psychology and physics, though they only process single images. Models like CLIP and VisualBERT encode visual information in high-level cortical areas but do not inherently capture video-level representations. This paper introduces a novel video-language architecture that incorporates pooled video embeddings into LLMs by first extracting spatiotemporal embeddings and mapping them to the model’s decoder through a learnable linear layer. We enhance the model by training it with video-caption pairs from the ADEPT and AGENT datasets, aimed at quantifying “surprisal” in physical and psychological contexts. Finally, we design separate voxel wise encoding models for videos involving physics and psychology using the hidden states and logits from the LLMs last layer and pre-projected CLIP embeddings. We find that hidden state activations can remarkably explain high variance (R^2 up to ∼70%) across dorsal physics regions and highly distributed, ventral social vision areas. Notably, for models trained to only encode physically surprising stimuli, the hidden states andpre-projected CLIP embeddings explain nearly identical regions of variance across the inferior-parietal lobule. However, when the encoding model is trained to encode only socially surprising events, hidden states explain far more distributed ventral and dorsal activations over pre-projected CLIP embeddings.