Statistical modeling of tongue motion during speech using cine magnetic resonance imaging (MRI) provides key information about the relationship between structure and motion of the tongue. In order to study the variability of tongue shape and motion in populations, a consistent integration and characterization of inter-subject variability is needed. In this paper, a method to construct a spatio-temporal atlas comprising a mean motion model and statistical modes of variation during speech is presented. The model is based on the cine-MRI from twenty two normal speakers and consists of several steps involving both spatial and temporal alignment problems independently. First, all images are registered into a common reference space, which is taken to be a neutral resting position of the tongue. Second, the tongue shapes of each individual relative to this reference space are produced. Third, a time warping approach (several are evaluated) is used to align the time frames of each subject to a common time series of initial mean images. Finally, the spatio-temporal atlas is created by time-warping each subject, generating new mean images at each time, and producing shape statistics around these mean images using principal component analysis at each reference time frame. Experimental results consist of comparison of various parameters and methods in creation of the atlas and a demonstration of the final modes of variations at various key time frames in a sample phrase.