Children live in a multimodal world: For example, communication with young children not only includes information from the auditory linguistic modality in the form of speech but also from the visual modality in the form of actions that caregivers use in the interaction with children. Dynamic systems approaches suggest that multimodal input can help children to learn from the environment while also allowing the child to shape their own learning experience through selective attention. This selective attention might be influenced by the child's preferences, which, in turn, might shape the child's learning behaviour. In this study, we investigated how children's selective attention to information from the linguistic or the action modality influence learning in both domains.Two- to 3-year-old children and adults participated in a novel gaze-contingent paradigm that allowed them to choose between being provided with the labels for or the actions that one can do with novel and familiar objects. At test, participants saw the two novel objects and either heard one of the labels or saw one of the actions that had been performed on one of the objects. Following label and action presentation, we investigated whether children fixated the target object, i.e., the object whose respective action/label had been presented, as an index of word and action learning. Children learned word but not action-object associations, and their target looking in the word-object condition was influenced by their selective attention to words in the earlier phase. Adults learned word-object associations and action-object associations, and their target looking in the action-object condition was influenced by their selective attention to actions in the earlier phase.Gaze-contingent eye-tracking paradigms provide us a unique method to analyse children's active learning preferences, which will help us better understand children's learning behaviour in a complex world. In particular, we show that in multimodal environments, children's preferences might help to structure the complex input into chunks that are compatible with the child's cognitive capacities in the moment.