Recently, the use of social media platforms has increased with ease of use and fast accessibility, making such platforms a place of rumor proliferation owing to the lack of posting constraints and content authentication. Therefore, there is a need to leverage artificial intelligence techniques to detect rumors on social media platforms to prevent their adverse effects on society and individuals. Most existing works that detect rumors in Arabic target the textual features of the tweet content. Nevertheless, tweets contain different types of content, such as (text, images, videos, and URLs), and the visual features of tweets play an essential role in rumor diffusion. This study proposes an Arabic rumor detection model to detect rumors on Twitter using textual and visual image features through two types of multimodal fusion: early and late fusion. In addition, we leveraged the transfer learning of the pre-trained language and vision models. Different experiments were conducted to select the best textual and visual feature extractors for building a multimodal model. MARBERTv2 was used as a textual feature extractor, whereas the ensemble of VGG-19 and ResNet50 was used as a visual feature extractor to build the multimodal model. Subsequently, the language and vision models of the single models were used as a baseline to compare their results with those of multimodal models. Finally, the experimental results demonstrate the effectiveness of textual features in rumor detection tasks compared to multimodal models.