Machine learning is one of the key building blocks in 5G and beyond [1,2,3] spanning a broad range of applications and use cases. In the context of mission-critical applications [2,4], machine learning models should be trained with fresh data samples that are generated by and dispersed across edge devices (e.g., phones, cars, access points, etc.). Collecting these raw data incurs significant communication overhead, which may violate data privacy. In this regard, federated learning (FL) [5,6,7,8] is a promising communication-efficient and privacy-preserving solution that periodically exchanges local model parameters, without sharing raw data. However, exchanging model parameters is extremely costly under modern deep neural network (NN) architectures that often have a huge number of model parameters. For instance, MobileBERT is a state-of-the-art NN architecture for on-device natural language processing (NLP) tasks, with 25 million parameters corresponding to 96 MB [9]. Training such a model by exchanging the 96 MB payload per communication round is challenging particularly under limited wireless resources.The aforementioned limitation of FL has motivated to the development of federated distillation (FD) [10] based on exchanging only the local model outputs whose dimensions are commonly much smaller than the model sizes (e.g., 10 labels in the MNIST dataset). To illustrate, as shown in Figure 1.1, consider a 2label classification example wherein each worker in FD runs local iterations with samples having either blue or yellow ground-truth label. For each training sample, the worker generates its prediction output distribution, termed a local logit that is a softmax output vector of the last NN layer activations (e.g., {blue, yellow} = {0.7, 0.3} for a blue sample). At a regular interval, the generated local logits of the worker are averaged per ground-truth label, and uploaded to a parameter server for aggregating and globally averaging the local average logits across workers per a H. Seo was with the