Personalized recommender systems, as effective approaches for alleviating information overload, have received substantial attention in the last decade. Learning effective latent factors plays the most important role in recommendation methods. Several recent works extracted latent factors from user-generated content such as ratings and reviews and suffered from the sparsity problem and the unbalanced distribution problem. To tackle these problems, we enrich the latent representations by incorporating user-generated content and item raw content. Deep neural networks have emerged as very appealing in learning effective representations in many applications. In this paper, we propose a novel deep neural architecture named DeepFusion to jointly learn user and item representations from numerical ratings, textual reviews, and item metadata. In this framework, we utilize multiple types of deep neural networks that are best suited for each type of heterogeneous inputs and introduce an extra layer to obtain the joint representations for users and items. Experiments conducted on the Amazon product data demonstrate that our approach outperforms multiple state-of-the-art baselines. We provide further insight into the design selections and hyperparameters of our recommendation method. In addition, we further explore the relative importance of various item metadata information on improving the rating prediction performance towards personalized product recommendation, which is extremely valuable for feature extraction in practice.