Language learners track conditional probabilities to find words in continuous speech and to map words and objects across ambiguous contexts. It remains unclear, however, whether learners can leverage the structure of the linguistic input to do both tasks at the same time and how that would impact learning. To explore these questions, we combined speech segmentation and cross-situational word learning into a single task. Participants had to track speech statistics (transitional and phonotactic probabilities) to segment words and, at the same time, track co-occurrences between these newly segmented words and objects across presentations to overcome ambiguity and learn word-object pairings. In Experiment 1, when adults (N = 60) simultaneously segmented continuous speech and mapped the newly segmented words to objects, they demonstrated better performance than when either task was performed alone. However, when the speech stream had conflicting information between transitional and phonotactic statistics, participants were still able to correctly map words to objects, but surprisingly, were at chance level on speech segmentation. In Experiment 2, we used a more sensitive speech segmentation measure to find that adults (N = 35), exposed to the same conflicting speech stream, correctly identified non-words as such, but were still unable to consistently discriminate between words and part-words. Again, mapping was above chance. Our study suggests that learners can track multiple sources of statistical information to find and map words to objects in complex environments. It also prompts critical questions on how to effectively measure the knowledge that may arise from these learning experiences.