Self-supervised learning (SSL) has emerged as a powerful method for extracting meaningful representations from vast, unlabeled datasets, already transforming computer vision and natural language processing. Similarly, in single-cell genomics (SCG), representation learning is well-recognized for offering insights into complex biological data, even more so by the advent of early foundation model approaches. However, despite these advancements, identifying scenarios in SCG where SSL outperforms traditional supervised or unsupervised learning methods remains a nuanced challenge. Furthermore, selecting the most effective pretext tasks within the SSL framework for SCG is a critical yet unresolved question. Here, we address this gap by adapting and benchmarking SSL techniques in SCG, including masked autoencoders with multiple masking strategies and contrastive learning approaches. Trained on over 20 million cells, this study rigorously examines multiple downstream tasks, including cell type prediction, gene expression reconstruction, cross-modality prediction, and data integration. Our empirical analyses underscore the nuanced role of SSL, namely in transfer learning scenarios leveraging auxiliary data or analyzing novel datasets. Masked autoencoders excel over contrastive methods in SCG, diverging from computer vision trends. Moreover, our findings reveal notable capabilities of SSL in zero-shot cell type prediction and offer insights into its potential benefits in cross-modality prediction and data integration. In summary, we study the application of SSL in SCG, minimizing model bias through simple, fully connected networks, and benchmark SSL's utility across key representation learning scenarios.