Annotating accelerometer-based physical activity data remains a challenging task, limiting the creation of robust supervised machine learning models due to the scarcity of large, labeled, free-living human activity recognition (HAR) datasets. Researchers are exploring self-supervised learning (SSL) as an alternative to relying solely on labeled data approaches. However, there has been limited exploration of the impact of large-scale, unlabeled datasets for SSL pre-training on downstream HAR performance, particularly utilizing more than one accelerometer. To address this gap, a transformer encoder network is pre-trained on various amounts of unlabeled, dual-accelerometer data from the HUNT4 dataset: 10, 100, 1k, 10k, and 100k hours. The objective is to reconstruct masked segments of signal spectrograms. This pre-trained model, termed SelfPAB, serves as a feature extractor for downstream supervised HAR training across five datasets (HARTH, HAR70+, PAMAP2, Opportunity, and RealWorld). SelfPAB outperforms purely supervised baselines and other SSL methods, demonstrating notable enhancements, especially for activities with limited training data. Results show that more pre-training data improves downstream HAR performance, with the 100k-hour model exhibiting the highest performance. It surpasses purely supervised baselines by absolute F1-score improvements of 7.1% (HARTH), 14% (HAR70+), and an average of 11.26% across the PAMAP2, Opportunity, and RealWorld datasets. Compared to related SSL methods, SelfPAB displays absolute F1-score enhancements of 10.4% (HARTH), 18.8% (HAR70+), and 16% (average across PAMAP2, Opportunity, RealWorld).