Recently, large pre-trained models have significantly improved the performance of various Natural Language Processing (NLP) tasks but they are expensive to serve due to long serving latency and large memory usage. To compress these models, knowledge distillation has attracted an increasing amount of interest as one of the most effective methods for model compression. However, existing distillation methods have not yet addressed the unique challenges of model serving in datacenters, such as handling fast evolving models, considering serving performance, and optimizing for multiple objectives. To solve these problems, we propose AutoDistill, an end-toend model distillation framework integrating model architecture exploration and multi-objective optimization for building hardware-efficient NLP pre-trained models. We use Bayesian Optimization (BO) to conduct multiobjective Neural Architecture Search (NAS) for selecting student model architectures. The proposed search comprehensively considers both prediction accuracy and serving latency on target hardware. We propose Flash Distillation, a model-agnostic technique using a much shorter period of progressive knowledge transfer to distinguish promising student model candidates from less promising ones. Together with the BO algorithm, it significantly reduces the cost during model exploration. The experiments on TPUv4i (Jouppi et al., 2021) show the finding of seven model architectures with better pre-trained accuracy (up to 3.2% higher) and lower inference latency (up to 1.44× faster) than MobileBERT (Sun et al., 2020). By running nine downstream NLP tasks in the GLUE benchmark, the model distilled for pre-training by AutoDistill with 28.5M parameters achieves an 81.69 average score, which is higher than BERT BASE (