A key component of experimentation in IR is statistical hypothesis testing, which researchers and developers use to make inferences about the effectiveness of their system relative to others. A statistical hypothesis test can tell us the likelihood that small mean differences in effectiveness (on the order of 5%, say) is due to randomness or measurement error, and thus is critical for making progress in research. But the tests typically used in IR-the t-test, the Wilcoxon signed-rank test-are very general, not developed specifically for the problems we face in information retrieval evaluation. A better approach would take advantage of the fact that the atomic unit of measurement in IR is the relevance judgment rather than the effectiveness measure, and develop tests that model relevance directly. In this work we present such an approach, showing theoretically that modeling relevance in this way naturally gives rise to the effectiveness measures we care about. We demonstrate the usefulness of our model on both simulated data and a diverse set of runs from various TREC tracks.