Recently, researchers have started looking at Android activities to improve user interface (UI) design. Since similar activities in Android have similar functional behaviors, activity clustering is a fundamental step toward efficient Android app development. Well-grouped activities are useful not only for UI design, but also for app design, development, and testing. However, there are no studies on activity clustering yet, and no activity dataset with labels and categories. The purpose of this study is to use the Rico dataset to know (i) whether the Rico dataset can be used for activity clustering, (ii) how useful activity attributes expressed in XML are for activity clustering, and (iii) how useful fusion with activity image and attributes is for activity clustering. We generate various activity latent vectors using a CNN autoencoder for the Rico dataset. Then, we produce a sequence-to-sequence latent vector from the semantic properties of the Rico dataset. Finally, by fusing the two models, we propose an activity clustering approach using multimodal learning. Since there are no labels in the dataset, we make 2000 labeled data for evaluation. The experimental results show that the activity clustering works well by fusing the semantic activity latent vector and the seq2seq latent vector. Especially, activity attributes such as component and position information are effective for activity clustering and help to boost the performance better than real activity images or Rico. Research findings on clustering and newly created labeled data can be a starting point for various studies on Android activity.