In word learning, learners need to identify the referent of words by leveraging the fact that the same word may co‐occur with different sets of objects. This raises the question, what do children remember from “in the moment” that they can use for cross‐situational learning? Furthermore, do children represent pictures of familiar animals versus drawings of non‐existent novel objects as potential referents differently? This study examined these questions by creating learning scenarios with only two potential referents, requiring the least amount of memory to represent all co‐present objects. Across three experiments (n > 250) with 4‐ and 6‐year‐old children, children reliably selected the intended referent from learning at test, though the learning of novel objects was better than familiar objects. When asked for a co‐present object, children of all ages in the study performed at chance in all of the conditions. We discuss the developmental differences in cross‐situational word learning capabilities with regard to representing different stimuli as potential referents. Importantly, all children used a propose‐but‐verify procedure for learning novel words even in the simplest of the learning scenarios given repeated exposure.