Parallel dictionary learning for multimodal voice conversion is proposed in this paper. Because of noise robustness of visual features, multimodal feature has been attracted in the field of speech processing, and we have proposed multimodal VC using Non-negative Matrix Factorization (NMF). Experimental results showed that our conventional multimodal VC can effectively converted in a noisy environment, however, the difference of conversion quality between audio input VC and multimodal VC is not so large in a clean environment. We assume this is because our exemplar dictionary is over-complete. Moreover, because of non-negativity constraint for visual features, our conventional multimodal NMF-based VC cannot factorize visual features effectively. In order to enhance the conversion quality of our NMF-based multimodal VC, we propose parallel dictionary learning. Non-negative constraint for visual features is removed so that we can handle visual features which include negative values. Experimental results showed that our proposed method effectively converted multimodal features in a clean environment.