Depth information is crucial in unsupervised video object segmentation (UVOS) and has been shown to be beneficial in contemporary RGB-D saliency object detection approaches. Nonetheless, current UVOS techniques frequently overly depend on spatiotemporal cues while disregarding valuable depth information. Such overreliance on motion cues can result in performance degradation under conditions of inadequate optical flow. We introduce a depth-informed cross-modal three-stream network (DIMF-Net) for UVOS, representing pioneering efforts to incorporate depth information for UVOS assistance. To achieve this, we introduce a gate fusion module to integrate depth features, thereby extending the conventional two-stream network. In addition, we acknowledge that data from various sensing modalities are frequently complementary but can also contain noisy measurements. To mitigate this, we leverage features from another modality to filter and calibrate noisy information. These improvements substantially enhance the performance of UVOS, showcasing exceptional accuracy and speed in real-world scenarios. DIMF-Net achieves top-tier performance on the DAVIS, FBMS, and YouTube-Objects datasets.