“…The disparity between the simplistic, passive learning environment we provided and the rich, multi-modal, and interactive experiences that shape infant learning is pronounced. Efforts to bridge this gap have included capturing infants' sensory experiences through head-mounted cameras (Vong, Wang, Orhan, & Lake, 2024;Emin Orhan, Wang, Wang, Ren, & Lake, 2024;Orhan, Gupta, & Lake, 2020;Sullivan, Mei, Perfors, Wojcik, & Frank, 2021), eye-tracking (Sheybani, Hansaria, Smith, & Tiganj, n.d.;Mendez, Yu, & Smith, n.d.;Candy et al, 2023), and simulating interaction with the environment via embodied agents (Wykowska, Chaminade, & Cheng, 2016). Our benchmark is poised to serve as a critical testing ground for models trained on these datasets.…”