Segregation of complex sounds such as speech, animal vocalizations, and music simultaneously emanating from multiple sources, referred to as the cocktail party problem is a remarkable ability that is common in humans and animals alike. The neural underpinnings of this process have been extensively studied behaviorally and physiologically in non-human animals primarily with simplified sounds (tones and noise sequences). In humans, segregation experiments utilizing more complex speech mixtures are common; but physiological experiments have relied on EEG/MEG/ECoG recordings that sample activity from many thousands of neurons, often obscuring the detailed processes that give rise to the observed segregation. The present study combines the insights attainable from animal single-unit physiology with segregation of speech- like mixtures. Ferrets were trained to attend to a female voice in a mixture of two simultaneous, equally salient male and female voices. The animals reliably detected a target female word, both when present in the female stream alone, or when embedded in a male/female voice mixture. Neural representation of the stimuli was recorded in single neurons in the primary and secondary ferret auditory cortical fields, as well as in the frontal cortex. During task performance, representation of the female words became more enhanced relative to those of the (distractor) male in all the cortical regions, especially in the higher auditory cortical field. Analysis of the temporal and spectral response characteristics during task performance reveals how speech segregation gradually emerges in the auditory cortex. A computational model evaluated on the same voice mixtures replicates and extends these results to different attentional targets (attention to female or male voices). These findings are consistent with the temporal coherence theory whereby attention to a target voice anchors neural activity in cortical networks hence binding together channels that are coherent with the target, and ultimately forming a common auditory stream. The experimental and modeling results shed light on neural correlates of streaming percepts.