Abstract. Extreme sea level events, such as storm surges, pose a threat to coastlines around the globe. Many tide gauges have been measuring the sea level and recording these extreme events for decades, some for over a century. The data from these gauges often serve as the basis for evaluating the extreme sea level statistics, which are used to extrapolate sea levels that serve as design values for coastal protection. Hydrodynamic models often have difficulty in correctly reproducing extreme sea levels and, consequently, extreme sea level statistics and trends. In this study, we generate a 13-member hindcast ensemble for the non-tidal Baltic Sea from 1979 to 2018 using the coastal ocean model GETM (General Estuarine Transport Model). In order to cope with mean biases in maximum water levels in the simulations, we include both simulations with and those without wind-speed adjustments in the ensemble. We evaluate the uncertainties in the extreme value statistics and recent trends of annual maximum sea levels. Although the ensemble mean shows good agreement with observations regarding return levels and trends, we still find large variability and uncertainty within the ensemble (95 % confidence levels up to 60 cm for the 30-year return level). We argue that biases and uncertainties in the atmospheric reanalyses, e.g. variability in the representation of storms, translate directly into uncertainty within the ensemble. The translation of the variability of the 99th percentile wind speeds into the sea level elevation is in the order of the variability of the ensemble spread of the modelled maximum sea levels. Our results emphasise that 13 members are insufficient and that regionally large ensembles should be created to minimise uncertainties. This should improve the ability of the models to correctly reproduce the underlying extreme value statistics and thus provide robust estimates of climate change-induced changes in the future.