Forensic science plays a critical role in the American criminal justice system. For decades, many feature-based fields of forensic science, such as firearm and toolmark identification, developed outside of the purview of the scientific community. Currently, black-box studies are used to assess the scientific validity of feature-based methods. The results of these studies are widely relied on by judges across the country. However, this reliance is misplaced. Black-box studies to date suffer from inappropriate sampling methods and high rates of missingness. Current black-box studies ignore both problems in arriving at the error rate estimates presented to courts. We explore the impact of each type of limitation using available data from black-box studies and court materials. We show that black-box studies rely on non-representative samples of examiners. Using a case study of a popular ballistics study, we find evidence that these nonrepresentative samples may commit fewer errors than the wider population from which they came. We also find evidence that the missingness in black-box studies is non-ignorable. Using data from a recent latent print study, we show that ignoring this missingness likely results in systematic underestimates of error rates. Finally, we offer concrete steps to overcome these limitations.