Objectives: Multimodal deep learning, incorporating images, text, videos, speech, and acoustic signals, has grown significantly. This article aims to explore the untapped possibilities of multimodal deep learning in Visual Question Answering (VQA) and address a research gap in the development of effective techniques for comprehensive image feature extraction. Methods: This article provides a comprehensive overview of VQA and the associated challenges. It emphasizes the need for an extensive representation of images in VQA and pinpoints the specific research gap pertaining to image feature extraction and highlights the fundamental concepts of VQA, the challenges faced, different approaches and applications used for VQA tasks. A substantial portion of this review is devoted to investigating recent advancements in image feature extraction techniques. Findings: Most existing VQA research predominantly emphasizes the accurate matching of answers to given questions, often overlooking the necessity for a comprehensive representation of images. These models primarily rely on question content analysis while underemphasizing image understanding or sometimes neglect image examination entirely. There is also a tendency in multimodal systems to neglect or overemphasize one modality, notably the visual one, which challenges genuine multimodal integration. This article reveals that there is limited benchmarking for image feature extraction techniques. Evaluating the quality of extracted image features is crucial for VQA tasks. Novelty: While many VQA studies have primarily concentrated on the accuracy of answers to questions, this review emphasizes the importance of comprehensive image representation. The paper explores recent advances in Capsules Networks (CapsNets) and Vision Transformers (ViTs) as alternatives to traditional Convolutional Neural Networks (CNNs), for development of more effective image feature extraction techniques which can help to address the limitations of existing VQA models that focus primarily on question content analysis. https://www.indjst.org/