Deep neural networks have recently become remarkable computational tools for thinking about human visual learning. Recent studies have explored the effects of altering naturalistic images and compared the responses of both humans and models, providing valuable insights into their functioning and how deep neural networks can shape our understanding of human learning. Critically, much of human visual learning happens throughout early development. Yet, well-controlled benchmarks comparing AI models with young humans are scarce. Here, we present a developmentally motivated benchmark of out-of-distribution (OOD) object recognition. Our benchmark, ModelVsBaby, includes a set of OOD conditions that have long been studied in the vision science literature, and are expected to be sensitive to the development of OOD object-recognition in humans: silhouette, geon, occluded, blurred, crowded background, and a baseline realistic condition. Along with the stimuli, we release a unique dataset of the responses of 2-year-old children to the stimuli. Our preliminary analyses of the dataset show several interesting patterns: 2-year-olds achieve 80% accuracy in the silhouette condition, nearly as well as in the realistic condition (chance=12%). They also perform well above chance, near 60% accuracy, on the other challenging conditions. We also evaluate image-text association (CLIP) models trained on varying amounts of internet-scale datasets. The model performances show that with enough data, all conditions are learnable by artificial learners. However, Realistic and Silhouette are learned with fewer training data similar to humans. Our benchmark stimuli and infant responses, provide an essential steppingstone for building computational models that are aligned with humans both in terms of the learning outcomes as well as the learning trajectory. This endeavor can furnish creating better models of visual development as well as improving the efficiency of AI systems for practical applications. Future work may use the benchmark stimuli to test more age groups, and provide a detailed comparison of models of various flavors in terms of “developmental alignment".