Voice conversion (VC) refers to the transformation of the speaker identity of a speech to the target one without altering the linguistic content. As recent VC techniques have made significant progress, implementing them in real-world scenarios is also considered, where speech data have some inevitable interferences, the most common of which are background sounds. On the other hand, background sounds are informative and need to be retained in some applications, such as VC in movies/videos. To address these issues, we have proposed a noisy-to-noisy (N2N) VC framework that does not rely on clean VC data and models the noisy speech directly by using noise as conditions. Previous experimental results have proven its effectiveness. In this paper, we further improve its performance by introducing the pretrained noise-conditioned VC model. Moreover, to further explore the impacts of introducing noise conditions, the performance in more realistic situations is evaluated in which the training set possesses speaker-dependent noisy conditions. The experimental results demonstrate the effectiveness of the pre-training strategy and the degradation of its performance under strict noisy conditions. We then proposed a noise augmentation method to overcome the limitation. Further experiments showed the effectiveness of the augmentation method.