Object segmentation in cluttered environments is a fundamental pre-processing step for many perception-related tasks such as vision-based robotic grasping. Most of the existing object segmentation methods are incapable of precisely segmenting unknown objects, particularly in scenarios exhibiting significant occlusion. In this paper, we propose a novel approach for refining the segmentation of unknown objects in cluttered scenes. More specifically, a ConvMixer-based UNet model is designed to enhance the segmentation mask and boundary of unknown objects appearing in cluttered scenes. In our model, we leverage the object's semantic and localization information, which are essential for successful segmentation, using a ConvMixer-based Cross Fusion (CMCF) module. Furthermore, we propose to use patch embedding as a pre-processing step, where input data is rearranged to expedite processing and improve the efficiency of the system. CM-UNet was trained and extensively tested on various challenging publicly available datasets, including unknown objects in un-structured scenes. Thorough evaluations, in terms of segmentation accuracy and processing efficiency, were conducted against state-of-the-art solutions, where the superiority of our model was proven. CM-UNet has shown its ability to efficiently improve the segmentation accuracy of unknown objects in cluttered scenes, even in presence of occlusion.