In order to communicate with their users in a natural and effective manner, humanlike robots must seamlessly integrate behaviors across multiple modalities, including speech, gaze, and gestures. While researchers and designers have successfully drawn on studies of human interactions to build models of humanlike behavior and to achieve such integration in robot behavior, the development of such models involves a laborious process of inspecting data to identify patterns within each modality or across modalities of behavior and to represent these patterns as "rules" or heuristics that can be used to control the behaviors of a robot, but provides little support for validation, extensibility, and learning. In this paper, we explore how a learning-based approach to modeling multimodal behaviors might address these limitations. We demonstrate the use of a dynamic Bayesian network (DBN) for modeling how humans coordinate speech, gaze, and gesture behaviors in narration and for achieving such coordination with robots. The evaluation of this approach in a human-robot interaction study shows that this learning-based approach is comparable to conventional modeling approaches in enabling effective robot behaviors while reducing the effort involved in identifying behavioral patterns and providing a probabilistic representation of the dynamics of human behavior. We discuss the implications of this approach for designing natural, effective multimodal robot behaviors.