In face-to-face interaction, speakers spontaneously produce manual gestures that can facilitate listeners' comprehension of spoken language. The present study explores the factors affecting the uptake and influence of gesture cues in situations where a speaker is referring to objects visible to the listener. In this context, the listener's attention must be distributed across various scene regions, potentially reducing the ability to draw on and apply gesture cues in real time. In two experiments, the instruction provided by a speaker (e.g., "pick up the candy") was accompanied by an iconic grasp gesture (produced alongside the verb) that reflected the size/shape of the intended target. Effects on listeners' comprehension were compared with a no-gesture condition. Experiment 1 (audiovisual gating task) showed that, under simplified processing circumstances, gesture cues allowed earlier identification of intended targets. Experiment 2 (eye tracking) explored whether this facilitation is found in real-time comprehension, and whether attention to gesture information is influenced by the acoustic environment (quiet vs. background noise). Measures of gaze position showed that although the speaker's gesturing hand was rarely fixated directly, gestures did facilitate comprehension, particularly when the target object was smaller relative to alternatives. The magnitude of the gesture effect was greater in quiet than in noise, suggesting that the latter did not provoke listeners to increase attention to gesture to compensate for the challenging auditory signal. Together, the findings clarify how situational factors influence listeners' attention to visual information during real-time comprehension.