ObjectivesIn this study, we propose a diagnostic model for automatic detection of otitis media based on combined input of otoscopy images and wideband tympanometry measurements.MethodsWe present a neural network‐based model for the joint prediction of otitis media and diagnostic difficulty. We use the subclassifications acute otitis media and otitis media with effusion. The proposed approach is based on deep metric learning, and we compare this with the performance of a standard multi‐task network.ResultsThe proposed deep metric approach shows good performance on both tasks, and we show that the multi‐modal input increases the performance for both classification and difficulty estimation compared to the models trained on the modalities separately. An accuracy of 86.5% is achieved for the classification task, and a Kendall rank correlation coefficient of 0.45 is achieved for difficulty estimation, corresponding to a correct ranking of 72.6% of the cases.ConclusionThis study demonstrates the strengths of a multi‐modal diagnostic tool using both otoscopy images and wideband tympanometry measurements for the diagnosis of otitis media. Furthermore, we show that deep metric learning improves the performance of the models.