With the explosive growth of online information, users may also face information overload. To handle this problem, recommender systems have become an effective strategy, which can analyze the characters of users and items to provide valuable information. One of the important types of information is the item’s side information. For example, in Amazon dataset, side information mainly includes visual side information (e.g., image and video), textual side information (e.g., title and description), and auxiliary side information (e.g., brand and category). To analyze various types of side information, some research designed multiple modalities for different types of side information, which can improve the performance of the recommender system. To analyze the deeper relationships between users and items, recent works also use a graph structure to represent the interactions. Existing works on multi-modal recommender systems using graph neural networks largely depend on the interaction records, while little effort focuses on the relationships between interactions and various types of side information. In this paper, we propose a novel multi-task learning model. First, we construct the interaction records to graphs for each modality to gather the representations, and then we analyze the representations of each modality and the specific side information based on the similarities. We design a Multi-task Multi-modal Graph Neural Network (MTMM-GNN) framework built upon message-passing with the attention mechanism of graph neural networks, which can generate the representations of users and items from interaction records, and then analyze the relationships between the representations from GNNs and item’s side information. We conduct experiments on two public datasets, Amazon and MovieLens, and the results of our model outperform the state-of-the-art methods.