Peikes, Moreno and Orzol (2008) sensibly caution researchers that propensity score analysis may not lead to valid causal inference in field applications. But at the same time, they made the far stronger claim to have performed an ideal test of whether propensity score matching in quasi-experimental data is capable of approximating the results of a randomized experiment in their dataset, and that this ideal test showed that such matching could not do so. In this article we show that their study does not support that conclusion because it failed to meet a number of basic criteria for an ideal test. By implication, many other purported tests of the effectiveness of propensity score analysis probably also fail to meet these criteria, and are therefore questionable contributions to the literature on the effects of propensity score analysis.
Keywords: Propensity Scores, Strong Ignorability, Quasi-Experiments, Within-Study ComparisonIn 2008, Peikes, Moreno and Orzol (henceforth PMO) published a case study in The American Statistician that cautioned social program evaluators that propensity score analysis may yield quite different results than those from a randomized experiment. We join them in that caution because we doubt that many applications of propensity scores meet some of the most basic conditions for their valid use (Shadish, 2012;Steiner, Cook & Shadish, 2011;Shadish, Cook, Steiner & Clark, 2010). In that sense, we applaud the PMO article. Field researchers need to appreciate how difficult it can be to use propensity score analysis in a way that yields confidence in the results.At the same time, however, PMO made a second claim, that they performed an "ideal" (pp. 222, 223, 230) test of whether propensity score matching in quasi-experimental data could approximate the results of a randomized experiment. In fact, a careful analysis of PMO suggests just the opposite conclusion, that they neither implemented an ideal propensity score analysis nor an ideal comparison of results from a propensity score analysis to results from a randomized experiment. In this article, we show why the PMO study was not ideal in both respects. Just as practitioners need to appreciate how difficult it can be to use propensity