Location-based social networks (LBSNs) have greatly promoted the development of the field of human mobility mining. However, the sparsity, multimodality and heterogeneity nature of the user check-in data remains a great concern for learning high-quality user or other entities representations, especially in the downstream application tasks, such as point-of-interest (POI) recommendation. Most existing methods focus on user preference modeling based on sequential POI tags without exploring the interaction between different modalities (e.g., user-user interactions, user-timestamp interactions, user-POI interactions, etc.). To this end, we introduce a multimodal interaction aware embedding framework to generate reliable entity embeddings on the heterogeneous socio-spatial network. At its core, first, multi-modal interaction sub-graph sampling techniques are designed to capture the heterogeneous contexts; then, a self-supervised contrastive learning technique is leveraged to extract intra-modality and inter-modality interactions in a light way. We conduct experiments on the next-POI recommendation tasks based on three real-world datasets. Experimental results demonstrate the superiority of our model over the state-of-the-art embedding learning algorithms.