Self-supervised learning techniques are celebrating immense success in natural language processing (NLP) by enabling models to learn from broad language data at unprecedented scales. Here, we aim to leverage the success of these techniques for mental state decoding, where researchers aim to identify specific mental states (such as an individual's experience of anger or happiness) from brain activity. To this end, we devise a set of novel self-supervised learning frameworks for neuroimaging data based on prominent learning frameworks in NLP. At their core, these frameworks learn the dynamics of brain activity by modeling sequences of activity akin to how NLP models sequences of text. We evaluate the performance of the proposed frameworks by pre-training models on a broad neuroimaging dataset spanning functional Magnetic Resonance Imaging (fMRI) data from 11, 980 experimental runs of 1, 726 individuals across 34 datasets and subsequently adapting the pretrained models to two benchmark mental state decoding datasets. We show that the pre-trained models transfer well, outperforming baseline models when adapted to the data of only a few individuals, while models pre-trained in a learning framework based on causal language modeling clearly outperform the others.Preprint. Under review.