Outdoor positioning has become a ubiquitous technology, leading to the proliferation of many location-based services such as automotive navigation and asset tracking. Meanwhile, indoor positioning is an emerging technology with many potential applications. Researchers are continuously working towards improving its accuracy, and one general approach to achieve this goal includes using machine learning to combine input data from multiple available sources, such as camera imagery. For this active research area, we conduct a systematic literature review and identify around 40 relevant research papers. We analyze contributions describing indoor positioning methods based on multimodal data, which involves combinations of images with motion sensors, radio interfaces, and LiDARs. The conducted survey allows us to draw conclusions regarding the open research areas and outline the potential future evolution of multimodal indoor positioning.