Despite the widespread use of indoor positioning technology, current WiFi positioning methods face challenges in accuracy and efficiency due to the complexities of indoor environments and the intricate architecture resulting from pre-deployed Access Points (APs). Moreover, while visual image positioning methods are effective, their costliness stemming from the requirement of using binocular cameras to obtain depth image data hinders their widespread adoption. To tackle these challenges, we propose an indoor positioning method that leverages multi-granularity and multi-modal feature fusion. In the initial, coarse-grained positioning phase, a neural network model is trained using image data to roughly localize indoor scenes. This model adeptly captures the geometric relationships within images, providing a foundation for more precise localization in subsequent stages. In the fine-grained positioning stage, Channel State Information (CSI) features from WiFi signals and Scale-Invariant Feature Transform (SIFT) features from image data are fused, creating a rich feature fusion fingerprint library that enables high-precision positioning. Experimental results show that our proposed method synergistically combines the strengths of WiFi fingerprint and visual positioning, resulting in a substantial enhancement of positioning accuracy. Specifically, our approach achieves an accuracy of 0.4 meters for 45\% of positioning points and 0.8 meters for 67\% of points. Overall, this approach charts a promising path forward for advancing indoor positioning technology.