Consider comparing two independent binomial responses. Our interest is whether the two binomial parameters are different, and if different, which is larger, and if larger, by how much. This apparently simple problem was addressed by Fisher in the 1930's, and has been the subject of many review papers since then. Yet there continues to be new work on this issue and no consensus solution. Previous reviews have focused primarily on testing and the properties of validity and power, or primarily on confidence intervals, their coverage, and expected length. Here we evaluate both together. For example, we consider whether a p-value and its matching confidence interval are compatible, meaning that the p-value rejects at level α if and only if the 1 − α confidence interval excludes all null parameter values. For focus, we only examine non-asymptotic inferences, so that most of the p-values and confidence intervals are valid (i.e., exact) by construction. Within this focus, we review different methods emphasizing many of the properties and interpretational aspects we desire from applied frequentist inference: validity, accuracy, good power, equivariance, compatibility, coherence, causal interpretation, and parameterization and direction of effect. We show that no one method can meet all the desirable properties and give recommendations based on which properties are given more importance.