The formation of categories, which constitutes the basis of developing concepts, requires multimodal information with a complex structure. We propose a model called the bag of multimodal hierarchical Dirichlet processes (BoMHDP), which enables robots to form a variety of multimodal categories. The BoMHDP model is a collection of a large number of MHDP models, each of which has a different set of weights for sensory information. The weights work to realize selective attention and enable the formation of various types of categories (e.g., object, haptic, and color). The BoMHDP model is an extension of the HDP, and categorization is unsupervised. However, categories that are not natural for humans are also formed. Therefore, only the significant categories are selected through interaction between the user and the robot. At the same time, words obtained during the interaction are connected to the categories. Finally, categories, which are represented by words, are selected. The BoMHDP model was implemented on a robot platform and a preliminary experiment was conducted to validate it. The results revealed that various categories can be formed with the BoMHDP model. We also analyzed the formed conceptual structure by using multidimensional scaling.The results indicate that the complex conceptual structure was represented reasonably well with the BoMHDP model.