Rail transit is becoming a major mode of rapid urban and intercity passenger and freight transportation, and its safe operation is of great significance in safeguarding people's lives and properties and maintaining social stability. The current scheme of manual hazard monitoring in rail transit still remains potential safety risks. Accurate rail scene understanding is an essential step towards a smart train. Limited by the closeness of railway scenes, not much research has been conducted on the perception and understanding of rail transit. In view of the above, we propose multimodal remote sensing image (MRSI), the first multimodal proximity remote sensing data set for rail scene understanding. MRSI consists of 27k images collected from freight rail and metro following the pixel and box annotations labeled and checked manually. We used a variety of sensing devices mounted on locomotives to record track scenes under different lighting and weather conditions, including straight, curve, and fork during daytime, dusk, and nighttime, as well as under rainy days. We also include an additional infrared thermometer in the metro environment, propose a new image registration method after synchronous acquisition, and thus construct MRSI combining spatial and radiometric properties. With this data set, we can achieve segmentation of the track area and recognition of obstacles by sensing the environment in front of the train, which lead to rail scene understanding. MRSI is publicly available at https://zenodo.org/record/5732905#.YaPIpsdBwdU.