The recognition of iconic correspondence between signal and referent has been argued to bootstrap the acquisition and emergence of language. Here, we study the ontogeny, and to some extent the phylogeny, of the ability to spontaneously relate iconic signals, gestures, and/or vocalizations, to previous experience. Children at 18, 24, and 36 months of age (N = 216) and great apes (N = 13) interacted with two apparatuses, each comprising a distinct action and sound. Subsequently, an experimenter mimicked either the action, the sound, or both in combination to refer to one of the apparatuses. Experiments 1 and 2 found no spontaneous comprehension in great apes and in 18‐month‐old children. At 24 months of age, children were successful with a composite vocalization‐gesture signal but not with either vocalization or gesture alone. At 36 months, children succeeded both with a composite vocalization‐gesture signal and with gesture alone, but not with vocalization alone. In general, gestures were understood better compared to vocalizations. Experiment 4 showed that gestures were understood irrespective of how children learned about the corresponding action (through observation or self‐experience). This pattern of results demonstrates that iconic signals can be a powerful way to establish reference in the absence of language, but they are not trivial for children to comprehend and not all iconic signals are created equal.