As intelligent voice assistants become more widespread and the scope of their listening increases, they become attractive targets for attackers. In the future, a malicious actor could train voice assistants to listen to audio outside their purview, creating a threat to users' privacy and security. How can this misbehavior be detected? Due to the ambiguities of natural language, people may need to work in conjunction with algorithms to determine whether a given conversation should be heard. To investigate how accurately humans can perform this task, we developed a framework for people to conduct "Test Drives" of always-listening services: after submitting sample conversations, users receive instant feedback about whether these would have been captured. Leveraging a Wizard of Oz interface, we conducted a study with 200 participants to determine whether they could detect one of four types of attacks on three different services. We studied the behavior of individuals, as well as groups working collaboratively, and investigated the effects of task framing on performance. We found that individuals were able to successfully detect malicious apps at varying rates (7.5% to 75%), depending on the type of malicious attack, and that groups were highly successful when considered collectively. Our results suggest that the Test Drive framework can be an effective tool for studying user behaviors and concerns, as well as a potentially welcome addition to voice assistant app stores, where it could decrease privacy concerns surrounding always-listening services.