Drug discovery is stochastic. The effectiveness of candidate compounds in satisfying design objectives is unknown ahead of time, and the tools used for prioritization—predictive models and assays—are inaccurate and noisy. In a typical discovery campaign, thousands of compounds may be synthesized and tested before design objectives are achieved, with many others ideated but deprioritized. These challenges are well-documented, but assessing potential remedies has been difficult. We introduceDrugGym, a frame-work for modeling the stochastic process of drug discovery. Emulating biochemical assays with realistic surrogate models, we simulate the progression from weak hits to sub-micromolar leads with viable ADME. We use this testbed to examine how different ideation, scoring, and decision-making strategies impact statistical measures of utility, such as the probability of program success within predefined budgets and the expected costs to achieve target candidate profile (TCP) goals. We also assess the influence of affinity model inaccuracy, chemical creativity, batch size, and multi-step reasoning. Our findings suggest that reducing affinity model inaccuracy from 2 to 0.5 pIC50 units improves budget-constrained success rates tenfold. DrugGym represents a realistic testbed for machine learning methods applied to the hit-to-lead phase. Source code is available atwww.drug-gym.org.