The human understanding of things is based on prediction which is made through concepts formed by categorization of their experience. To mimic this mechanism in robots, multimodal categorization, which enables the robot to form concepts, has been studied. On the other hand, segmentation and categorization of human motions have also been studied to recognize and predict future motions. This paper addresses the issue of how these concepts are integrated to generate higher level concepts and, more importantly, how the higher level concepts affect each lower level concept formation. To this end, we propose multi-layered multimodal latent Dirichlet allocation (mMLDA) to learn and represent the hierarchical structure of concepts. We also examine a simple integration model and compare with the mMLDA. The experimental results reveal that the mMLDA leads to better inference performance and, indeed, forms higher level concepts integrating motions and objects that are necessary for real-world understanding.