This study investigated the usefulness of deep learning-based automatic detection of temporomandibular joint (TMJ) effusion using magnetic resonance imaging (MRI) in patients with temporomandibular joint disorder (TMD) and whether the diagnostic accuracy of the model improved when patients’ clinical information was provided in addition to MRI images. The sagittal MR images of 2,948 TMJs were collected from 1,017 women and 457 men (mean age 37.19 ± 18.64 years). The TMJ effusion diagnostic performances of three convolutional neural networks (scratch, fine-tuning, and freeze schemes) were compared with those of human experts based on areas under the curve (AUCs) and diagnosis accuracies. The fine-tuning model with proton density (PD) images showed acceptable prediction performance (AUC = 0.7895), and the from-scratch (0.6193) and freeze (0.6149) models showed lower performances (p < 0.05). The fine-tuning model had excellent specificity compared to the human experts (87.25% vs. 58.17%). However, the human experts were superior in sensitivity (80.00% vs. 57.43%) (all p < 0.001). In Grad-CAM visualizations, the fine-tuning scheme focused more on effusion than on other structures of the TMJ, and the sparsity was higher than that of the from-scratch scheme (82.40% vs. 49.83%, p < 0.05). The Grad-CAM visualizations agreed with the model learned through important features in the TMJ area, particularly around the articular disc. Two fine-tuning models on PD and T2-weighted images showed that the diagnostic performance did not improve compared with using PD alone (p < 0.05). Diverse AUCs were observed across each group when the patients were divided according to age (0.7083–0.8375) and sex (male:0.7576, female:0.7083). The prediction accuracy of the ensemble model was higher than that of the human experts when all the data were used (74.21% vs. 67.71%, p < 0.05). A deep neural network (DNN) was developed to process multimodal data, including MRI and patient clinical data. Analysis of four age groups with the DNN model showed that the 41–60 age group had the best performance (AUC = 0.8258). There was no significant difference between the prediction performances of the fine-tuning model and the DNN (p > 0.05). The fine-tuning model and DNN were optimal for judging TMJ effusion and may be used to prevent true negative cases and aid in human diagnostic performance. Assistive automated diagnostic methods have the potential to increase clinicians’ diagnostic accuracy.