Rigorous performance engineering traditionally assumes measuring on bare-metal environments to control for as many confounding factors as possible. Unfortunately, some researchers and practitioners might not have access, knowledge, or funds to operate dedicated performance testing hardware, making public clouds an attractive alternative. However, cloud environments are inherently unpredictable and variable with respect to their performance. In this study, we explore the effects of cloud environments on the variability of performance testing outcomes, and to what extent regressions can still be reliably detected. We focus on software microbenchmarks as an example of performance tests, and execute extensive experiments on three different cloud services (AWS, GCE, and Azure) and for different types of instances. We also compare the results to a hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 5 million unique microbenchmarking data points from benchmarks written in Java and Go. We find that the variability of results differs substantially between benchmarks and instance types (from 0.03% to > 100% relative standard deviation). We also observe that testing using Wilcoxon rank-sum generally leads to unsatisfying results for detecting regressions due to a very high number of false positives in all tested configurations. However, simply testing for a difference in medians can be employed with good success to detect even small differences. In some cases, a difference as low as a 1% shift in median execution time can be found with a low false positive rate given a large sample size of 20 instances.