To propose good practices for using the structural similarity metric (SSIM) and reporting its value. SSIM is one of the most popular image quality metrics in use in the medical image synthesis community because of its alleged superiority over voxel-by-voxel measurements like the average error or the peak signal noise ratio (PSNR). It has seen massive adoption since its introduction, but its limitations are often overlooked. Notably, SSIM is designed to work on a strictly positive intensity scale, which is generally not the case in medical imaging. Common intensity scales such as the Houndsfield units (HU) contain negative numbers, and they can also be introduced by image normalization techniques such as the z-normalization. Methods: We created a series of experiments to quantify the impact of negative values in the SSIM computation. Specifically, we trained a three-dimensional (3D) U-Net to synthesize T2-weighted MRI from T1-weighted MRI using the BRATS 2018 dataset. SSIM was computed on the synthetic images with a shifted dynamic range. Next, to evaluate the suitability of SSIM as a loss function on images with negative values, it was used as a loss function to synthesize znormalized images. Finally, the difference between two-dimensional (2D) SSIM and 3D SSIM was investigated using multiple 2D U-Nets trained on different planes of the images.
Results:The impact of the misuse of the SSIM was quantified; it was established that it introduces a large downward bias in the computed SSIM. It also introduces a small random error that can change the relative ranking of models. The exact values for this bias and error depend on the quality and the intensity histogram of the synthetic images. Although small, the reported error is significant considering the small SSIM difference between state-of -the-art models. It was shown therefore that SSIM cannot be used as a loss function when images contain negative values due to major errors in the gradient calculation, resulting in under-performing models. 2D SSIM was also found to be overestimated in 2D image synthesis models when computed along the plane of synthesis, due to the discontinuities between slices that is typical of 2D synthesis methods. Conclusion: Various types of misuse of the SSIM were identified, and their impact was quantified. Based on the findings, this paper proposes good practices when using SSIM, such as reporting the average over the volume of the image containing tissue and appropriately defining the dynamic range.