Singing voice separation on robots faces the problem of interpreting ambiguous auditory signals. The acoustic signal, which the humanoid robot perceives through its onboard microphones, is a mixture of singing voice, music, and noise, with distortion, attenuation, and reverberation. In this paper, we used the 3 directional Inception-Resnet structure in the U-shaped encoding and decoding network to improve the utilization of the spatial and spectral information of the spectrograms. Multi-objectives were used to train the model: magnitude consistency loss, phase consistency loss, and magnitude correlation consistency loss. We recorded the singing voice and accompaniment derived from the MIR-1k datasets with NAO robots and synthesized the 10-channel datasets for training the model. The experimental results show that the proposed model trained by multi-objective reaches an average NSDR of 11.55db on the test datasets, which outperforms the comparison model.