The direction of arrival (DOA) and number of sound sources is usually estimated by short-time Fourier transform and the conjugate cross-spectrum. However, the ability of a single AVS to distinguish between multiple sources will decrease as the number of sources increases. To solve this problem, this paper presents a multimodal fusion method based on a single acoustic vector sensor (AVS). First, the output of the AVS is decomposed into multiple modes by intrinsic time-scale decomposition (ITD). The number of sources in each mode decreases after decomposition. Then, the DOAs and source number in each mode are estimated by density peak clustering (DPC). Finally, the density-based spatial clustering of applications with the noise (DBSCAN) algorithm is employed to obtain the final source counting results from the DOAs of all modes. Experiments showed that the multimodal fusion method could significantly improve the ability of a single AVS to distinguish multiple sources when compared to methods without multimodal fusion.