Humans effortlessly recognize social interactions from visual input. The best models of this ability are generative inverse planning models, which make predictions by simulating agents' inferred goals, suggesting humans use a similar process of mental simulation. However, growing behavioral and neuroscience evidence suggests that recognizing social interactions is a visual process, separate from complex mental simulation. Yet despite their success in other domains, visual neural network models have been unable to reproduce human-like interaction recognition. We hypothesize that humans rely on relational visual information in particular, which is lacking from standard neural networks, and develop a new relational, graph neural network model, SocialGNN. Unlike prior models, SocialGNN accurately predicts human interaction judgments across both animated and natural videos. These results suggest that humans can make complex social interaction judgments without explicit simulation or inference about agents’ mental states, and that structured, relational visual representations are key to this behavior.