We propose a new method for human speech intelligibility evaluation based on keyword spotting. In this method, participants play a stimulus and select the word they hear from a close set of alternatives. To find which sentence to use, the target word, and alternatives we mine a large set of stimuli using a phonetic dictionary and a language model. Unlike other tests, our method does not rely on specially designed sentences and can be used to evaluate in-the-wild material such as TED talks. We focus on audio-visual (AV) speech enhancement (SE) evaluation as a study case. We compared our method to a transcription task and observed that the two produce highly correlated results, albeit our task requiring substantially less participation time. We then adopted it on a large-scale evaluation of AVSE systems. Results show that keyword spotting is a suitable and efficient alternative to assess intelligibility from AV stimuli.