With the development of single-cell technology, many cell traits (e.g. gene expression, chromatin accessibility, DNA methylation) can be measured. Furthermore, the multi-omic profiling technology could jointly measure two or more traits in a single cell simultaneously. In order to process the various data accumulated rapidly, computational methods for multimodal data integration are needed. Previously, we developed inClust, a flexible all-in deep generative framework for transcriptome data. Here, we extend the applicability of inClust into the realm of multimodal data by adding two mask modules: an input-mask module in front of the encoder and an output-mask module behind the decoder. We call this augmented model inClust+, and apply it to various multimodal data. InClust+ was first used to integrate scRNA and MERFISH data from similar cell populations and to impute MERFISH data based on scRNA data. Then, inClust+ is shown to have the capability to integrate a multimodal data contain scRNA and scATAC or two multimodal CITE datasets with batch effect. Finally, inClust+ is used to integrate a monomodal scRNA dataset and two multimodal CITE datasets, and generate the missing modality of surface protein in monomodal scRNA data. In the above examples, the performance of inClust+ is better than or comparable to the most recent tools to the corresponding task, which prove inClust+ is a suitable framework for handling multimodal data. Meanwhile, the successful implementation of mask in inClust+ means that it can be applied to other deep learning methods with similar encoder-decoder architecture to broaden the application scope of these models.