Abstract-In image-guided robotic surgery, labeling and segmenting the endoscopic video stream into meaningful parts provides important contextual information that surgeons can exploit to enhance their perception of the surgical scene. This information can provide surgeons with real-time decision-making guidance before initiating critical tasks such as tissue cutting. Segmenting endoscopic video is a very challenging problem due to a variety of complications including significant noise and clutter attributed to bleeding and smoke from cutting, poor color and texture contrast between different tissue types, occluding surgical tools, and limited (surface) visibility of the objects' geometries on the projected camera views. In this paper, we propose a multi-modal approach to segmentation where preoperative 3D computed tomography scans and intraoperative stereo-endoscopic video data are jointly analyzed. The idea is to segment multiple poorly visible structures in the stereo/multichannel endoscopic videos by fusing reliable prior knowledge captured from the preoperative 3D scans. More specifically, we estimate and track the pose of the preoperative models in 3D and consider the models' non-rigid deformations to match with corresponding visual cues in multi-channel endoscopic video and segment the objects of interest. Further, contrary to most augmented reality frameworks in endoscopic surgery that assume known camera parameters, an assumption that is often violated during surgery due to non-optimal camera calibration and changes in camera focus/zoom, our method embeds these parameters into the optimization hence correcting the calibration parameters within the segmentation process. We evaluate our technique in several scenarios: synthetic data, ex vivo lamb kidney datasets, and in vivo clinical partial nephrectomy surgery with results demonstrating high accuracy and robustness.