In medical imaging an enormous variety of algorithms have been proposed to reconstruct a cross section of the human body. In assessing the relative task‐oriented performance of reconstruction algorithms, it is desirable to assign statistical significance to claims of superiority of one algorithm over another. However, very often the achievement of statistical significance demands a large number of observations. Performing such an evaluation on mathematical phantoms requires a means of running the competing algorithms on projection data obtained from a large number of randomly generated phantoms. Thereafter, various numerical measures of agreement between the reconstructed images and the original phantoms may be used to reach a conclusion which has some statistical substance. In this article we describe the software SuperSNARK, which automates an evaluation methodology for assigning statistical significance to the observed differences in performance of two or more image reconstruction algorithms. As a demonstration, we compare the relative efficacy of the maximum likelihood expectation maximization (ML‐EM) algorithm and the filtered backprojection (FBP) method for performing three medical tasks in positron emission tomography (PET)—estimation of total uptake by structures, detection of relatively higher uptake between pairs of symmetric structures, and estimation of uptake at individual points within structures. We find that for estimating total uptake ML‐EM outperforms FBP, for detecting relatively higher uptake there is not a statistically significant difference between the two methods, and for estimating pointwise uptake FBP outperforms ML‐EM. It is demonstrated that SuperSNARK makes it easy to apply the methodology of statistical hypothesis testing to substantiate such claims of task‐specific superiority of one reconstruction algorithm over another. © 1996 John Wiley & Sons, Inc.