Deep learning has proven to be effective in diagnosing faults in power machinery and its diagnosis performance relies on a sufficient data set. In practice, a well-labeled data set with sufficient samples is very rare, especially for those machinery running in varying loading cases. The situation is particularly pronounced for multi-cylinder internal combustion engines, where the excitations from cylinders interact with significant background noise, and different data distributions are complicated. To tackle these issues, we propose a novelty multi-modal joint attention network (MJA-Net) for fusing the vibration and acoustic signals for diagnosing multiple faults. In MJA-Net, feature maps from both modalities are input separately into the convolutional module to learn independent features, and joint attention module (JAM) is utilized to enhance the vibro-acoustic information interaction and distribution consistency across modalities. The analysis of multiple loads vibro-acoustic experimental data shows that MJA-Net has a superior classification performance in limited sample tasks, compared to the single-modal methods. Furthermore, MJA-Net outperforms other fusion methods on average accuracy at 97.65%, as well as feature representativeness, and vibro-acoustic feature consistency across loads. JAM has superior diagnosis performance to other alternative modules. The class activation maps (CAM) generated by the Layer CAM highlight the key impact components related to the engine working mechanisms, providing valuable insight into MJA-Net’s interpretation for multi-fault recognition.