Vocalization is an essential medium for social and sexual signaling in most birds and mammals. 1 Consequently, the analysis of vocal behavior is of great interest to fields such as neuroscience and 2 linguistics. A standard approach to analyzing vocalization involves segmenting the sound stream 3 into discrete vocal elements, calculating a number of handpicked acoustic features, and then using 4 the feature values for subsequent quantitative analysis. While this approach has proven powerful, 5 it suffers from several crucial limitations: First, handpicked acoustic features may miss important 6 dimensions of variability that are important for communicative function. Second, many analyses 7 assume vocalizations fall into discrete vocal categories, often without rigorous justification. Third, a 8 syllable-level analysis requires a consistent definition of syllable boundaries, which is often difficult 9 to maintain in practice and limits the sorts of structure one can find in the data. To address these 10 shortcomings, we apply a data-driven approach based on the variational autoencoder (VAE), an 11 unsupervised learning method, to the task of characterizing vocalizations in two model species: 12 the laboratory mouse (Mus musculus) and the zebra finch (Taeniopygia guttata). We find that the 13 VAE converges on a parsimonious representation of vocal behavior that outperforms handpicked 14 acoustic features on a variety of common analysis tasks, including representing acoustic similarity and 15 recovering a known effect of social context on birdsong. Additionally, we use our learned acoustic 16 features to argue against the widespread view that mouse ultrasonic vocalizations form discrete 17 syllable categories. Lastly, we present a novel "shotgun VAE" that can quantify moment-by-moment 18 variability in vocalizations. In all, we show that data-derived acoustic features confirm and extend 19 existing approaches while offering distinct advantages in several critical applications. 20 1 Introduction 21 Vocalization is an essential medium for social and sexual signaling in most birds and mammals, and also serves as a 22 natural substrate for language and music in humans. Consequently, the analysis of vocal behavior is of great interest to 23 ethologists, psychologists, linguists, and neuroscientists. A major goal of these various lines of enquiry is to develop 24 methods for quantitative analysis of vocal behavior, efforts that have resulted in several powerful methods that enable 25 the automatic or semi-automatic analysis of vocalizations. Key to this approach has been the existence of software 26 packages that calculate acoustic features for each syllable within a vocalization [4, 39, 40, 7, 6]. For example, Sound 27 Analysis Pro, focused on birdsong, calculates 14 features for each syllable, including duration, spectral entropy, and 28 goodness of pitch, and uses these as a basis for subsequent clustering and analysis [39]. More recently, MUPET and 29 DeepSqueak have applied a similar approach to mouse vocalizati...