Over the past two decades, 'visually situated' language comprehension (the interplay between language comprehension, attention, and non-linguistic visual context) has emerged as an increasingly active area of research. One important result in this area is that both linguistic and world knowledge, as well as visual cues, can rapidly inform the unfolding interpretation as ref lected by comprehenders' eye movements to objects during spoken language comprehension. However, upon closer inspection, temporal delays of object-directed gaze are not infrequent and emerge for the processing of non-canonical (vs. canonical) structures, for scalar implicatures and for recently learned world-language associations. While it may further be tempting to assume that the different knowledge sources and visual cues are on a par in guiding visual attention, comprehenders' eye movements in many instances reveal a robust referential priority (more looks go to the referent of a word than to other objects). Should this priority be taken as a trivial observation? In the present article, we argue that the tension between this referential priority and other world-language relations constitutes an important constraint on the linking hypotheses and mechanisms implicated in situated language comprehension and should be considered when conceptualizing models and accounts of visually situated language comprehension.