Traditional Chinese medicine (TCM) is found on a long‐term medical practice in China. Rare human brains can fully grasp the deep TCM knowledge derived from a tremendous amount of experience. In this big data era, a big electronic brain might be competent via deep learning techniques. For this prospect, the electronic brain needs to process various heterogeneous data, such as images, texts, audio signals, and other sensory data. It used to be a challenge to analyze the heterogeneous data by the computer‐aided system until the advances of the powerful deep learning tools. We propose a multimodal deep learning framework to mimic a TCM practitioner to diagnose a patient on the basis of multimodal perceptions of see, listen, smell, ask, and touch. The framework learns common representations from various high‐dimensional sensory data, and fuse the information for final classification. We propose to use conceptual alignment deep neural networks to embed prior knowledge and obtain interpretable latent representations. We implement a multimodal deep architecture to process tongue image and description text data for TCM diagnosis. Experiments illustrate that the multimodal deep architecture can extract effective features from heterogeneous data, produce interpretable representations, and finally achieve a higher accuracy than either corresponding unimodal architectures.