In recent years, intelligent fault diagnosis methods based on deep learning have developed rapidly. However, most of the existing work performs well under the assumption that training and testing samples are collected from the same distribution, and the performance drops sharply when the data distribution changes. For rolling bearings, the data distribution will change when the load and speed change. In this article, to improve fault diagnosis accuracy and anti-noise ability under different working loads, a transfer learning method based on multi-scale capsule attention network and joint distributed optimal transport (MSCAN-JDOT) is proposed for bearing fault diagnosis under different loads. Because multi-scale capsule attention networks can improve feature expression ability and anti-noise performance, the fault data can be better expressed. Using the domain adaptation ability of joint distribution optimal transport, the feature distribution of fault data under different loads is aligned, and domain-invariant features are learned. Through experiments that investigate bearings fault diagnosis under different loads, the effectiveness of MSCAN-JDOT is verified; the fault diagnosis accuracy is higher than that of other methods. In addition, fault diagnosis experiment is carried out in different noise environments to demonstrate MSCAN-JDOT, which achieves a better anti-noise ability than other transfer learning methods.