BackgroundDeep learning (DL) has been widely used for diagnosis and prognosis prediction of numerous frequently occurring diseases. Generally, DL models require large datasets to perform accurate and reliable prognosis prediction and avoid overlearning. However, prognosis prediction of rare diseases is still limited owing to the small number of cases, resulting in small datasets.PurposeThis paper proposes a multimodal DL method to predict the prognosis of patients with malignant pleural mesothelioma (MPM) with a small number of 3D positron emission tomography–computed tomography (PET/CT) images and clinical data.MethodsA 3D convolutional conditional variational autoencoder (3D‐CCVAE), which adds a 3D‐convolutional layer and conditional VAE to process 3D images, was used for dimensionality reduction of PET images. We developed a two‐step model that performs dimensionality reduction using the 3D‐CCVAE, which is resistant to overlearning. In the first step, clinical data were input to condition the model and perform dimensionality reduction of PET images, resulting in more efficient dimension reduction. In the second step, a subset of the dimensionally reduced features and clinical data were combined to predict 1‐year survival of patients using the random forest classifier. To demonstrate the usefulness of the 3D‐CCVAE, we created a model without the conditional mechanism (3D‐CVAE), one without the variational mechanism (3D‐CCAE), and one without an autoencoder (without AE), and compared their prediction results. We used PET images and clinical data of 520 patients with histologically proven MPM. The data were randomly split in a 2:1 ratio (train : test) and three‐fold cross‐validation was performed. The models were trained on the training set and evaluated based on the test set results. The area under the receiver operating characteristic curve (AUC) for all models was calculated using their 1‐year survival predictions, and the results were compared.ResultsWe obtained AUC values of 0.76 (95% confidence interval [CI], 0.72–0.80) for the 3D‐CCVAE model, 0.72 (95% CI, 0.68–0.77) for the 3D‐CVAE model, 0.70 (95% CI, 0.66–0.75) for the 3D‐CCAE model, and 0.69 (95% CI 0.65–0.74) for the without AE model. The 3D‐CCVAE model performed better than the other models (3D‐CVAE, p = 0.039; 3D‐CCAE, p = 0.0032; and without AE, p = 0.0011).ConclusionsThis study demonstrates the usefulness of the 3D‐CCVAE in multimodal DL models learned using a small number of datasets. Additionally, it shows that dimensionality reduction via AE can be used to learn a DL model without increasing the overlearning risk. Moreover, the VAE mechanism can overcome the uncertainty of the model parameters that commonly occurs for small datasets, thereby eliminating the risk of overlearning. Additionally, more efficient dimensionality reduction of PET images can be performed by providing clinical data as conditions and ignoring clinical data‐related features.