Visual narratives communicate event sequences by using different code systems such as pictures and texts. Thus, comprehenders must integrate information from different codalities. This study addressed such cross-codal integration processes by investigating how the codality of bridging-event information (i.e., pictures, text) affects the understanding of visual narrative events. In Experiment 1, bridging-event information was either present (as picture or text) or absent (i.e., not shown). The viewing times for the subsequent picture depicting the end state of the action were comparable within the absent and the text conditions. Further, the viewing times for the end-state picture were significantly longer in the text condition as compared to the pictorial condition. In Experiment 2, we tested whether replacing bridging-event information with a blank panel increases viewing times in a way similar to the text condition. Bridging event information was either present (as picture) or absent (not shown vs. blank panel). The results replicated Experiment 1. Additionally, the viewing times for the end-state pictures were longest in the blank condition. In Experiment 3, we investigated the costs related to integrating information from different codalities by directly comparing the text and picture conditions with the blank condition. The results showed that the distortion caused by the blank panel is larger than the distortion caused by cross-codal integration processes. Summarizing, we conclude that cross-codal information processing during narrative comprehension is possible but associated with additional mental effort. We discuss the results with regard to theories of narrative understanding.