Depression ranks among the most prevalent mood-related psychiatric disorders. Existing clinical diagnostic approaches relying on scale interviews are susceptible to individual and environmental variations. In contrast, the integration of neuroimaging techniques and computer science has provided compelling evidence for the quantitative assessment of major depressive disorder (MDD). However, one of the major challenges in computer-aided diagnosis of MDD is to automatically and effectively mine the complementary cross-modal information from limited datasets. In this study, we proposed a few-shot learning framework that integrates multi-modal MRI data based on contrastive learning. In the upstream task, it is designed to extract knowledge from heterogeneous data. Subsequently, the downstream task is dedicated to transferring the acquired knowledge to the target dataset, where a hierarchical fusion paradigm is also designed to integrate features across inter-and intra-modalities. Lastly, the proposed model was evaluated on a set of multi-modal clinical data, achieving average scores of 73.52% and 73.09% for accuracy and AUC, respectively. Our findings also reveal that the brain regions within the default mode network and cerebellum play a crucial role in the diagnosis, which provides further direction in exploring reproducible biomarkers for MDD diagnosis.