This paper proposes a video scene segmentation framework referred to as a Contrasting Multi-Modal Similarity (CMS). Video is composed of multiple scenes which are short stories or semantic units of video, with each scene consisting of multiple shots. The task of video scene segmentation aims to semantically segment long videos, such as movies, into the sequence of scenes by identifying the boundaries of each scene transition. Current video scene segmentation frameworks have primarily relied on comparing only the visual cues of adjacent shots to identify scene boundaries. These frameworks have focused on two major approaches: 1) comparing only the visual cues of adjacent frames to distinguish between scenes and 2) performing clustering based on visual cues for distinction among scenes. However, within videos, there exist numerous scenes that are difficult to distinguish using visual information alone, as they often appear similar or ambiguous. Taking inspiration from the aforementioned issues, we propose a framework referred to as CMS that leverages not only visual cues (i.e., shots) but also textual cues (i.e., captions) to semantically distinguish scenes. The new framework, CMS, leverages visual cues and text cues as follows: (1) Generate captions corresponding to each shot using a zero-shot captioning model (Caption Generation). ( 2) Construct similarity score matrices for each modality to measure semantic similarities (Similarity Score Calculation).(3) Based on the above matrix, select similar shots and dissimilar shots for contrastive training (Similarity Score-based Sampling). Our experiments show that the CMS framework advances the performance to exceed the previous state-of-the-art methods with a relatively simple approach without complex model architectures.