Spoken word recognition requires complex, invariant representations. Using a meta-analytic approach incorporating more than 100 functional imaging experiments, we show that preference for complex sounds emerges in the human auditory ventral stream in a hierarchical fashion, consistent with nonhuman primate electrophysiology. Examining speech sounds, we show that activation associated with the processing of short-timescale patterns (i.e., phonemes) is consistently localized to left mid-superior temporal gyrus (STG), whereas activation associated with the integration of phonemes into temporally complex patterns (i.e., words) is consistently localized to left anterior STG. Further, we show left midto anterior STG is reliably implicated in the invariant representation of phonetic forms and that this area also responds preferentially to phonetic sounds, above artificial control sounds or environmental sounds. Together, this shows increasing encoding specificity and invariance along the auditory ventral stream for temporally complex speech sounds.functional MRI | meta-analysis | auditory cortex | object recognition | language S poken word recognition presents several challenges to the brain. Two key challenges are the assembly of complex auditory representations and the variability of natural speech (SI Appendix, Fig. S1) (1). Representation at the level of primary auditory cortex is precise: fine-grained in scale and local in spectrotemporal space (2, 3). The recognition of complex spectrotemporal forms, like words, in higher areas of auditory cortex requires the transformation of this granular representation into Gestalt-like, object-centered representations. In brief, local features must be bound together to form representations of complex spectrotemporal contours, which are themselves the constituents of auditory "objects" or complex sound patterns (4, 5). Next, representations must be generalized and abstracted. Coding in primary auditory cortex is sensitive even to minor physical transformations. Object-centered coding in higher areas, however, must be invariant (i.e., tolerant of natural stimulus variation) (6). For example, whereas the phonemic structure of a word is fixed, there is considerable variation in physical, spectrotemporal form-attributable to accent, pronunciation, body size, and the like-among utterances of a given word. It has been proposed for visual cortical processing that a feed-forward, hierarchical architecture (7) may be capable of simultaneously solving the problems of complexity and variability (8-12). Here, we examine these ideas in the context of auditory cortex.In a hierarchical pattern-recognition scheme (8), coding in the earliest cortical field would reflect the tuning and organization of primary auditory cortex (or core) (2, 3, 13). That is, single-neuron receptive fields (more precisely, frequency-response areas) would be tuned to particular center frequencies and would have minimal spectrotemporal complexity (i.e., a single excitatory zone and one-to-two inhibitory side bands). ...