Accurate modeling of protein-ligand interactions (PLIs) is critical for drug discovery. Despite advancements, most existing PLIs modeling methods rely on single-modal data, restricting their effectiveness and applicability. In this study, we introduce Uni-Clip, a contrastive learning framework that incorporates multi-modalities, specifically structure and residue features of proteins, along with conformation and graph features of ligands. Through optimization with specifically designed CF-InfoNCE loss, Uni-Clip achieves comprehensive representations for PLIs. Uni-Clip demonstrates superior performance in benchmark evaluations on widely acknowledged datasets, LIT-PCBA and DUD-E, achieving a 147% and 218% improvements in enrichment factors at 1% compared to baselines. Furthermore, Uni-Clip serves as a practical tool for various applications in drug discovery, as demonstrated through virtual screening for a flat and challenging protein target GPX4, where it identified potent inhibitors with an IC50 of 4.17 uM, and through target fishing for benzbromarone, which highlights the potential for repurposing benzbromarone in cancer therapy.