In recent years, the rise of location-based service applications such as cashier-less shopping, mobile advertisement targeting, and geo-based augmented reality (AR) has been remarkable. These applications offer convenient and interactive experiences by utilizing indoor localization technology. One popular research area in indoor localization is passive fingerprinting localization based on Channel State Information (CSI), which uses general-purpose Wi-Fi platforms and "unconscious cooperative sensing" to achieve device-free localization. However, existing studies face challenges related to inadequate fingerprint richness, limited distinguishability, and inconsistent fingerprint features in real-world dynamic environments. To address these challenges, we prpose MFFLoc in this paper. MFFLoc extracts and processes amplitude and phase information from CSI in a 2D manner. It then fuses the amplitude and phase information using multimodal fusion representation, resulting in rich and distinguishable fused fingerprint features. This approach allows MFFLoc to achieve satisfactory accuracy with just one communication link, reducing deployment costs. To overcome the issue of inconsistent fingerprint features in dynamic environments, MFFLoc proposes an unsupervised domain adaptation method. It employs a dual-flow structure, with one flow operating in the source domain and the other in the target domain. The adaptation layer, with correlated weights, remains unshared between the two flows. Meta-learning is also used to automatically determine the most suitable adaptation layer. Through extensive 6-day experiments conducted in a dynamic indoor environment, MFFLoc showcases superior performance compared to state-ofthe-art systems. It demonstrates higher localization accuracy and robustness, making it a promising solution for indoor localization applications.