Cross-device federated learning (FL) protects user privacy by collaboratively training a model on user devices, therefore eliminating the need for collecting, storing, and manually labeling user data. Previous works have considered cross-device FL for automatic speech recognition (ASR), however, there are a few important challenges that have not been fully addressed. These include the lack of ground-truth ASR transcriptions, and the scarcity of compute resource and network bandwidth on edge devices. In this paper, we address these two challenges. First, we propose a federated learning system to support ondevice ASR adaptation with full self-supervision, which uses self-labeling together with data augmentation and filtering techniques. The proposed system can improve a strong Emformer-Transducer based ASR model pretrained on out-of-domain data, using in-domain audios without any ground-truth transcriptions. Second, to reduce the training cost, we propose a self-restricted RNN Transducer (SR-RNN-T) loss, a new variant of alignmentrestricted RNN-T that uses Viterbi forced-alignment from selfsupervision. To further reduce the compute and network cost, we systematically explore adapting only a subset of weights in the Emformer-Transducer. Our best training recipe achieves a 12.9% relative WER reduction over the strong out-of-domain baseline, which equals 70% of the reduction achievable with full human supervision and centralized training.