Learning from a text–picture multimedia document is particularly effective if learners can link information within the text and across the verbal and the pictorial representations. The ability to create a mental model successfully and include those implicit links is related to the ability to generate inferences. Text processing research has found that text cohesion facilitates the generation of inferences, and thus text comprehension for learners with poor prior knowledge or reading abilities, but is detrimental for learners with good prior knowledge or reading abilities. Moreover, multimedia research has found a positive effect from adding visual representations to text information, particularly when implementing signaling, which consists of verbal or visual cues designed to guide attention to the pictorial representation of relevant information. We expected that, as with text-only documents, struggling readers would benefit from high text cohesion (Hypothesis 1) and that signaling would foster inference generation as well (Hypothesis 2). Further, we hypothesized that better learning outcomes would be observed when text cohesion was low and signaling was present (Hypothesis 3). Our first experimental study investigated the effect of those two factors (cohesion and signaling) on three levels of comprehension (text based, local inferences, global inferences). Participants were adolescents in prevocational schools (n = 95), where some of the students are struggling readers. The results showed a trend in favor of high cohesion, but with no significant effect, a significant positive effect of cross-representational signaling (CRS) on comprehension from local inferences, and no interaction effect. A second experiment focused on signaling only and attention toward the picture, with collection of eye-tracking data in addition to measures of offline comprehension. As this study was conducted with university students (n = 47), who are expected to have higher reading abilities and thus are less likely to benefit from high cohesion, the material was presented in its low cohesive version. The results showed no effect of conditions on comprehension performances but confirmed differences in processing behaviors. Participants allocated more attention to the pictorial representation in the CRS condition than in the no signaling condition.