White-box test generator tools rely only on the code under test to select test inputs, and capture the implementation's output as assertions. If there is a fault in the implementation, it could get encoded in the generated tests. Tool evaluations usually measure fault-detection capability using the number of such fault-encoding tests. However, these faults are only detected, if the developer can recognize that the encoded behavior is faulty. We designed an exploratory study to investigate how developers perform in classifying generated white-box test as faulty or correct. We carried out the study in a laboratory se ing with 54 graduate students. e tests were generated for two open-source projects with the help of the IntelliTest tool. e performance of the participants were analyzed using binary classi cation metrics and by coding their observed activities. e results showed that participants incorrectly classi ed a large number of both fault-encoding and correct tests (with median misclassi cation rate 33% and 25% respectively). us the real fault-detection capability of test generators could be much lower than typically reported, and we suggest to take this human factor into account when evaluating generated white-box tests.