In this paper, a new self-supervised strategy for learning meaningful representations of complex optical Satellite Image Time Series (SITS) is presented. The methodology proposed, named U-BARN, a Unet-BERT spAtio-temporal Representation eNcoder, exploits irregularly sampled SITS. The designed architecture allows learning rich and discriminative features from unlabeled data, enhancing the synergy between the spatio-spectral and the temporal dimensions. To train on unlabeled data, a time series reconstruction pretext task inspired by the BERT strategy but adapted to SITS is proposed. A Sentinel-2 large-scale unlabeled data-set is used to pre-train U-BARN. During the pre-training, U-BARN processes annual time series composed of a maximum of 100 dates. To demonstrate its feature learning capability, representations of SITS encoded by U-BARN are then fed into a shallow classifier to generate semantic segmentation maps. Experimental results are conducted on a labeled crop data-set (PASTIS) as well as a dense land cover data-set (MultiSenGE). Two ways of exploiting U-BARN pretraining are considered: either U-BARN weights are frozen or fine-tuned. The obtained results demonstrate that representations of SITS given by the frozen U-BARN are more efficient for land cover and crop classification than those of a supervised-trained linear layer. Then, we observe that fine-tuning boosts U-BARN performances on MultiSenGE dataset. Additionally, we observe on PASTIS, in scenarios with scarce reference data, that the fine-tuning brings a significative performance gain compared to fully-supervised approaches. We also investigate the influence of the percentage of elements masked during pre-training on the quality of the SITS representation. Eventually, semantic segmentation performances show that the fully supervised U-BARN architecture reaches better performances than the spatiotemporal baseline (U-TAE) on both downstream tasks: crop and dense land cover segmentation.