Water, indispensable for life and central to ecosystems, human activities, and climate dynamics, requires rapid and accurate monitoring. This is vital for sustaining ecosystems, enhancing human welfare, and effectively managing land, water, and biodiversity on both the local and global level. In the rapidly evolving domain of remote sensing and deep learning, this study focuses on water body extraction and classification through the use of recent deep learning models of visual foundation models (VFMs). Specifically, the Segment Anything Model (SAM) and Contrastive Language-Image Pre-training (CLIP) models have shown promise in semantic segmentation, dataset creation, change detection, and instance segmentation tasks. A novel two-step approach involving segmenting images via the Automatic Mask Generator method of the SAM and the zero-shot classification of segments using CLIP is proposed, and its effectiveness is tested on water body extraction problems. The proposed methodology was applied to both remote sensing imagery acquired from LANDSAT 8 OLI and very high-resolution aerial imagery. Results revealed that the proposed methodology accurately delineated water bodies across complex environmental conditions, achieving a mean intersection over union (IoU) of 94.41% and an F1 score of 96.97% for satellite imagery. Similarly, for the aerial imagery dataset, the proposed methodology achieved a mean IoU of 90.83% and an F1 score exceeding 94.56%. The high accuracy achieved in selecting segments predominantly classified as water highlights the effectiveness of the proposed model in intricate environmental image analysis.