Multimodal imitation of actions, gestures and vocal production is a hallmark of the evolution of human communication, as both, vocal learning and visual-gestural imitation, were crucial factors that facilitated the evolution of speech and singing. Comparative evidence has revealed that humans are an odd case in this respect, as the case for multimodal imitation is barely documented in non-human animals. While there is evidence of vocal learning in birds and in mammals like bats, elephants and marine mammals, evidence in both domains, vocal and gestural, exists for two Psittacine birds (budgerigars and grey parrots) and cetaceans only. Moreover, it draws attention to the apparent absence of vocal imitation (with just a few cases reported for vocal fold control in an orangutan and a gorilla and a prolonged development of vocal plasticity in marmosets) and even for imitation of intransitive actions (not object related) in monkeys and apes in the wild. Even after training, the evidence for productive or “true imitation” (copy of a novel behavior, i.e., not pre-existent in the observer’s behavioral repertoire) in both domains is scarce. Here we review the evidence of multimodal imitation in cetaceans, one of the few living mammalian species that have been reported to display multimodal imitative learning besides humans, and their role in sociality, communication and group cultures. We propose that cetacean multimodal imitation was acquired in parallel with the evolution and development of behavioral synchrony and multimodal organization of sensorimotor information, supporting volitional motor control of their vocal system and audio-echoic-visual voices, body posture and movement integration.