Semantic segmentation is significant for robotic indoor activities. However, relying solely on RGB modality often leads to poor results due to limited information. Introducing other modalities can improve performance but also increases complexity and cost, making it unsuitable for real-time robotic applications. To address the balance issue of performance and speed in robotic indoor scenarios, we propose an interactive efficient multitask RGB-D semantic segmentation network (IEMNet) that utilizes both RGB and depth modalities. On the premise of ensuring rapid inference speed, we introduce a cross-modal feature rectification module, which calibrates the noise of RGB and depth modalities and achieves comprehensive cross-modal feature interaction. Furthermore, we propose a coordinate attention fusion module to achieve more effective feature fusion. Finally, an instance segmentation task is added to the decoder to assist in enhancing the performance of semantic segmentation. Experiments on two indoor scene datasets, NYUv2 and SUNRGB-D, demonstrate the superior performance of the proposed method, especially on the NYUv2, achieving 54.5% mIoU and striking an excellent balance between performance and inference speed at 42 frames per second.