The Visual Voice Activity Detection (V-VAD) problem in unconstrained environments is investigated in this paper. A novel method for V-VAD in the wild, exploiting local shape and motion information appearing at spatiotemporal locations of interest for facial video description and the Bag of Words (BoW) model for facial video representation, is proposed. Facial video classification is subsequently performed using state-of-theart classification algorithms. Experimental results on one publicly available V-VAD data set denote the effectiveness of the proposed method, since it achieves better generalization performance in unseen users, when compared with recently proposed state-ofthe-art methods. Additional results on a new, unconstrained, data set provide evidence that the proposed method can be effective even in such cases in which any other existing method fails.