I explore the neural and evolutionary origins of phonological hierarchy, building on Peter MacNeilage's frame/content model, which suggests that human speech evolved from primate nonvocal jaw oscillations, for example, lip smack displays, combined with phonation. Considerable recent data, reviewed here, support this proposition. I argue that the evolution of speech motor control required two independent components. The first, identified by MacNeilage, is the diversification of phonetic “content” within a simple sequential “frame,” and would be within reach of nonhuman primates, by simply intermittently activating phonation during lip smack displays. Such voicing control requires laryngeal control, hypothesized to necessitate direct corticomotor connections to the nucleus ambiguus. The second component, proposed here, involves imposing additional hierarchical rhythmic structure upon the “flat” control sequences typifying mammalian vocal tract oscillations and is required for the flexible combinatorial capacity observed in modern phonology. I hypothesize that phonological hierarchy resulted from a marriage of a preexisting capacity for sequential structure seen in other primates, with novel hierarchical motor control circuitry (potentially evolved in tool use and/or musical contexts). In turn, this phonological hierarchy paved the way for phrasal syntactic hierarchy. I support these arguments using comparative and neural data from nonhuman primates and birdsong.