Speech is a continuous stream of sounds. To perceive speech, it is necessary to allocate discrete units with different frequency, volume and duration during its sounding. The purpose of this study was to identify the responses of the human cortex and midbrain to the transition from a consonant to a vowel sound in a syllable. The study analyzed and compared evoked potentials (EP) recorded using deep electrodes in 2 patients during intraoperative monitoring (IOM) with EP recorded in 29 healthy volunteers from the head surface. Groups of peaks following the beginning of the stimulus sound and the transition from consonant to vowel sound were detected on the EP registered in response to syllables and vowel sounds. Similar groups of short-latency peaks – S (from “start”) and C (from “change”), following the beginning of the stimulus sound and the transition from consonant to vowel sound, respectively, were distinguished on the patients’ EP. Their latencies had no significant differences (p 0.05). Similarly, complexes of long-latent peaks N1S-P2S and N1C-P2C, similar to each other, were isolated on the EP of healthy volunteers. Their latencies also had no significant differences (p 0.05). During the sounding of the stimulus, the cortex performs high-level (cognitive) sound processing, while the midbrain performs low-level (primary) processing, firstly providing rapid transmission of information to the cortex. With pathologies of the auditory structures of the thalamus and cortex, the ability to respond to changes in the characteristics of sound during its sounding, including speech, is likely to be impaired or lost.