Our brain is skilled with the ability to perceive and process multimodal stimuli. This process known as crossmodal perceptual integration, has been in the research spotlight for a long time, providing evidence for the integration of information coming from different modalities. Prior research mostly utilized pictures and focused on the semantic content of a single sound or word. The present study aims to investigate crossmodal perceptual integration in realistic conditions using short movieclips (1500ms) and auditory meaningful three-word sentences to evaluate target detection in terms of accuracy and response times. Two experimental tasks were developed using PsychoPy, where participants had to indicate whether a target (noun for Experiment 1, verb for Experiment 2) was present or absent. In trials without a target, target-related information was always present, either through one of the two senses (vision or audition; incongruent condition) or through both senses (congruent condition). We observed superior performance when the target was absent generally (Mexp1 =93.9%, SDexp1 =0.04, Mexp2 =84.2%, SDexp2 =0.182) compared to when it was present (Mexp1 =82.3%, SDexp1 =0.143, Mexp2 =73.7%, SDexp2 =0.184). Moreover, superior performance was noted in incongruent target-related movieclips, which significantly decreased during congruent target-related movieclips. In Experiment 1, we observed that in the audio condition when the target-related word was a noun, participant performance was superior compared to when it was a verb (M=99.4% vs. M=86.7%; tincVerb vs. incNoun =-8.428, p=.001). In Experiment 2, the judgment scores were similarly high in incongruent movieclips and significantly lower in congruent ones regardless of whether the target-related information presented was a verb or noun. The present results provide evidence regarding the role of complexity of semantics, and especially the diverse role verbs and nouns could play in crossmodal perceptual integration in more realistic situations. Our findings can enrich the content of learning techniques, as well as the design of AI models, by taking advantage of the supporting role of semantic audiovisual information, while taking into consideration the potential confusion that the complexity of semantic information can induce to perception experience.