The development of deep learning technology has greatly promoted the performance improvement of automatic speech recognition (ASR) technology, which has demonstrated an ability comparable to human hearing in many tasks. Voice interfaces are becoming more and more widely used as input for many applications and smart devices. However, existing research has shown that DNN is easily disturbed by slight disturbances and makes false recognition, which is extremely dangerous for intelligent voice applications controlled by voice.The research on adversarial samples is currently mainly concentrated in the image domain. In both the physical world and the black box model, it is possible to achieve targeted adversarial attacks by slightly modifying the samples. Due to the high dimensionality of voice data and the complexity of the ASR system, it is extremely tough to implement adversarial attacks in the digital world and the physical world in the voice field. Existing black-box adversarial attack methods require frequent access to the target model to obtain evaluation scores, then adjust adversarial samples to achieve targeted attacks. Audio samples that can implement black-box attacks usually lack concealment, and the victim can easily perceive the content of the instruction. Strongly concealed adversarial attacks, which constrain the perturbation value within a tiny range, can only achieve the attack effect on the white-box model. We propose a non-contact black-box adversarial attack algorithm with high transferability, which achieves an 81.57% success rate of adversarial attacks on the commercially available speech API. In addition, we searched the most suitable masking music for the adversarial samples based on the psychoacoustic model to improve the concealment of the samples, and the samples after the disguise still have a 69.27% attack success rate. We have verified the effectiveness of adversarial attacks in both the digital world and the physical world. Attackers just need ordinary speakers or mobile phones as playback devices to achieve physical adversarial attacks.The adversarial examples with masking music can attack voice applications and smart voice devices in real scenarios.