Image-based profiling of the cellular response to drug compounds has proven to be an effective method to characterize the morphological changes resulting from chemical perturbation experiments. This approach has been useful in the field of drug discovery, ranging from phenotype-based screening to identifying a compound’s mechanism of action or toxicity. As a greater amount of data becomes available however, there are growing demands for deep learning methods to be applied to perturbation data. In this paper we applied the transformer-based SwinV2 computer vision architecture to predict the mechanism of action of 10 kinase inhibitor compounds directly from raw images of the cellular response. This method outperforms the standard approach of using image-based profiles, multidimensional feature set representations generated by bioimaging software. Furthermore, we combined the best performing models for three different data modalities, raw images, image-based profiles and compound chemical structures, to form a fusion model, Cell-Vision Fusion (CVF). This approach classified the kinase inhibitors with 69.79% accuracy and 70.56% F1 score, 4.20% and 5.49% greater, respectively, than the best performing image-based profile method. Our work provides three techniques, specific to Cell Painting images, which enable the SwinV2 architecture to train effectively, and explores approaches to combat the significant batch effects present in large Cell Painting perturbation datasets.