Introduction: Information overload and complexity are characteristic of decision-making in medicine. In these conditions information fusion techniques are effective. For the diagnosis and treatment of pneumonia using x-ray images and accompaniyng free-text radiologists reports, it is promising to use text-image fusion. Purpose: Development of a method for fusing text with an image in the treatment and diagnosis of pneumonia using neural networks. Methods: We used MIMIC-CXR dataset, the SEResNeXt101-32x4d for images feature extracting and the Bio-ClinicalBERT model followed by ContextLSTM layer for text featureextracting. We compared five architectures in the conducted experiment: image classifier, report classifier and three scenarios of the fusion, namely late fusion, middle fusion and early fusion. Results: We got an absolute excess of metrics (ROC AUC = 0.9933, PR AUC = 0.9907) when using an early fusion classifier (ROC AUC = 0.9921, PR AUC = 0.9889) even over the idealized case of text classifier (that is, without taking into account possible errors of the radiologist). The network training time ranged from 20 minutes for late fusion to 9 hours and 45 minutes for early fusion. Based on Class Activation Map technique we graphically showed that the image feature extractor in the fused classification scenario still learns discriminative regions for pneumonia classification problem. Discussion: Fusing text and images increases the likelihood of correct image classification compared to only image classification. The proposed combined image-report classifier trained with the early-fusion method gives better performance than individual classifiers in the pneumonia classification problem. However, it is worth considering that better results cost the training time and required computation resources. Report-based training is much faster in training and less demanding for computation capacity.