Abstract. Reconstructing past variations of the global mean surface temperature is used to characterise the Earth system response to perturbations as well as validate Earth system simulations. Reconstructing GMST beyond the instrumental period relies on algorithms aggregating local proxy temperature records. Here, we propose to establish standards for the evaluation of the performance of such reconstruction algorithms. Our framework relies on pseudo-proxy experiments. That is, we test the ability of the algorithm to reconstruct a simulated GMST, using artificially generated proxy data created from the same simulation. We apply the framework to an adapted version of the GMST reconstruction algorithm used in Snyder (2016), and the synthesis of marine proxy records for temperature of the last 130 kyr from Jonkers et al. (2020). We use an ensemble of 4 transient simulations of the last glacial cycle or the last 25 kyr for the pseudo-proxy experiments. We find the algorithm to be able to reconstruct timescales longer than 4 kyr over the last 25 kyr. However, beyond 40 kyr BP, age uncertainty limits the algorithm capability to timescales longer than 15 kyr. The main sources of uncertainty are a factor, that rescales near global mean sea surface temperatures to GMST, the proxy measurement, the specific set of record locations, and potential seasonal bias. Increasing the number of records significantly reduces all sources of uncertainty but the scaling. We also show that a trade-off exists between the inclusion of a large number of records, which reduces the uncertainty on long time scales, and of only records with low age uncertainty, high accumulation rate, and high resolution, which improves the reconstruction of the short timescales. Finally, the method and the quantitative results presented here can serve as a basis for future evaluations of reconstructions. We also suggest future avenues to improve reconstruction algorithms and discuss the key limitations arising from the proxy data properties.