A standing conversational group (also known as F-formation) occurs when two or more people sustain a social interaction, such as chatting at a cocktail party. Detecting such interactions in images or videos is of fundamental importance in many contexts, like surveillance, social signal processing, social robotics or activity classification. This paper presents an approach to this problem by modeling the socio-psychological concept of an F-formation and the biological constraints of social attention. Essentially, an F-formation defines some constraints on how subjects have to be mutually located and oriented while the biological constraints defines the plausible zone in which persons can interact. We develop a game-theoretic framework embedding these constraints, which is supported by a statistical modeling of the uncertainty associated with the position and orientation of people. First, we use a novel representation of the affinity between pairs of people expressed as a distance between distributions over the most plausible oriented region of attention.Additionally, we integrate temporal information over multiple frames to smooth noisy head orientation and pose estimates, solve ambiguous situations and establish a more precise social context. We do this in a principled way by using recent notions from multi-payoff evolutionary game theory. Experiments on several benchmark datasets consistently show the superiority of the proposed approach over state of the art and its robustness under severe noise conditions.Author has been partially supported by the European Commission under contract number FP7-ICT-600877 (SPENCER) and is affiliated with the Delft Data Science consortium.