Owing to the widespread deployment of face and speaker recognition systems, research on attacks on neural‐network‐based biometric systems, which involves face or voice signal classification problems with a low‐dimensional output vector, has drawn increasing attention. Recently, cross‐modal voice‐to‐face (VTF) systems have learned to generate faces from voices by matching several biometric characteristics of the generated faces to those of speakers. However, attacks focusing on VTF systems with high‐dimensional face image outputs have not yet been conducted. In this paper, we introduce various adversarial attack methods for the VTF system under different attack conditions. These methods can generate a fake face close to the target face or far from the original face, by adding subtle perturbations to the original voice. Under the white‐box setting, we formulate a multiobjective optimization to generate target faces and improve the imperceptibility of the adversarial sample. Further a stepwise iterative optimization strategy is proposed to achieve faster and more effective attacks. Finally, the results of comparative experiments with various methods are demonstrated. Under the black‐box setting, the adversarial samples generated from surrogate models are able to generate the fake face far from the original one. Qualitative and quantitative experimental results show the high target face‐matching rate and irrelevance to the original face, as well as the imperceptibility of the adversarial audio. This study provides useful insights for privacy protection and improving generation robustness for information security.