The rapid development of large pre-trained language models has greatly increased the demand for model compression techniques, among which quantization is a popular solution. In this paper, we propose Binary-BERT, which pushes BERT quantization to the limit by weight binarization. We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscape. Therefore, we propose ternary weight splitting, which initializes BinaryBERT by equivalently splitting from a half-sized ternary network. The binary model thus inherits the good performance of the ternary one, and can be further enhanced by fine-tuning the new architecture after splitting. Empirical results show that our Binary-BERT has only a slight performance drop compared with the full-precision model while being 24× smaller, achieving the state-of-the-art compression results on the GLUE and SQuAD benchmarks. (a) Full-precision Model. (b) Ternary Model. (c) Binary Model. (d) All Together.Figure 2: Loss landscapes visualization of the full-precision, ternary and binary models on MRPC. For (a), (b) and (c), we perturb the (latent) full-precision weights of the value layer in the 1 st and 2 nd Transformer layers, and compute their corresponding training loss. (d) shows the gap among the three surfaces by stacking them together. (a) MHA-QK. (b) MHA-V. (c) MHA-O. (d) FFN-Mid. (e) FFN-Out.