Detecting anchor's voice in live musical streams is an important preprocessing step for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. This paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs for better detection of the target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as a mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion using the proposed rule, the detection results of the A-V branch outperform that of the audio branch in the same model framework; 2) the performance of the bimodal A-V model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attract more attention to the cross-modal music and audio signal processing, a new live musical video corpus with frame-level labels is introduced.