Theories of attention argue that objects are the units of attentional selection. In real-word environments such objects can contain visual and auditory features. To understand how mechanisms of selective attention operate in multisensory environments, we created an audiovisual cocktail-party situation, in which two speakers (left and right of fixation) simultaneously articulated brief numerals. In three separate blocks, informative auditory speech was presented (a) alone or paired with (b) congruent or (c) uninformative visual speech. In all blocks, subjects localized a pre-defined numeral. While audiovisual-congruent and uninformative speech improved response times and speed of information uptake according to diffusion modeling, an ERP analysis revealed that this did not coincide with enhanced attentional engagement. Yet, consistent with object-based attentional selection, the deployment of auditory spatial attention (N2ac) was accompanied by visuo-spatial attentional orienting (N2pc) irrespective of the informational content of visual speech. Notably, an N2pc component was absent in the auditory-only condition, demonstrating that a sound-induced shift of visuo-spatial attention relies on the availability of audio-visual features evolving coherently in time. Additional analyses revealed cross-modal interactions in working memory and modulations of cognitive control. The preregistered methods and hypotheses of this study can be found at https://osf.io/vh38g.