Multimodal sensors capture and integrate diverse characteristics of a scene to maximize information gain. In optics, this may involve capturing intensity in specific spectra or polarization states to determine factors such as material properties or an individual’s health conditions. Combining multimodal camera data with shape data from 3D sensors is a challenging issue. Multimodal cameras, e.g., hyperspectral cameras, or cameras outside the visible light spectrum, e.g., thermal cameras, lack strongly in terms of resolution and image quality compared with state-of-the-art photo cameras. In this article, a new method is demonstrated to superimpose multimodal image data onto a 3D model created by multi-view photogrammetry. While a high-resolution photo camera captures a set of images from varying view angles to reconstruct a detailed 3D model of the scene, low-resolution multimodal camera(s) simultaneously record the scene. All cameras are pre-calibrated and rigidly mounted on a rig, i.e., their imaging properties and relative positions are known. The method was realized in a laboratory setup consisting of a professional photo camera, a thermal camera, and a 12-channel multispectral camera. In our experiments, an accuracy better than one pixel was achieved for the data fusion using multimodal superimposition. Finally, application examples of multimodal 3D digitization are demonstrated, and further steps to system realization are discussed.