Educational content of many kinds and from many disciplines are increasingly presented in the form of short videos made broadly accessible via platforms such as YouTube. We argue that understanding how such communicative forms function effectively (or not) demands a more thorough theoretical foundation in the principles of multimodal communication that is also capable of engaging with, and driving, empirical studies. We introduce the basic concepts adopted and discuss an empirical study showing how functional measures derived from the theory of multimodality we employ and results from a recipient-based study that we conducted align. We situate these results with respect to the state of the art in cognitive research in multimodal learning and argue that the more complex multimodal interactions and artifacts become, the more a fine-grained view of multimodal communication of the kind we propose will be essential for engaging with such media, both theoretically and empirically.