Lightweight child-worn recorders that collect audio across an entire day allow for a big-data approach to the study of language development. By collecting the child's production and linguistic environment, these recordings offer us a uniquely naturalistic view of everyday language uses. However, such recordings quickly accumulate thousands of hours of audio and require the use of automatic speech processing algorithms. Besides providing ecologically-valid measures of what children hear and say, these recordings can fuel computational models of early language acquisition with what infants truly hear. This opens up new opportunities for running realistic language learning simulations.A first aspect of my doctoral work is dedicated to developing automatic speech processing algorithms for child-centered long-form recordings. In this manuscript, I first show that current state-of-the-art automatic speech recognition systems fail to capture the complexity of naturalistic speech as found in long-forms. I then present our attempt to propose a free, open-source, and more accurate alternative to the LENA proprietary software, which is currently the standard tool for obtaining automatic analyses of long-forms. Using supervised learning methods, my collaborators and I built a suite of speech processing tools to detect voice activity, identify voice signal sources (child vocalizations, female or male speech), count the number of linguistic units (phonemes, syllables, or words), and estimate the quantity of background noise and reverberation. A second aspect of my doctoral work is dedicated to computational models of early language acquisition. I present a first modeling study showing that self-supervised learning algorithms trained on audiobooks can learn phonetic and lexical aspects of their training language. I then show that the same algorithm trained on ecological long-forms needs inductive biases to learn phonetic aspects of its training language reliably and reflect on whether similar inductive biases may guide language learning in infants. Interestingly, there is no evidence for lexical learning on long-forms, contrary to what has been shown in the literature on more curated data. This series of studies illustrates the importance of considering ecologically-valid input data when modeling language acquisition.