Recently, multi-modal fusion methods based on remote sensing data and social sensing data have been widely used in the field of urban region function recognition. However, due to the high complexity of noise problem, most of the existing methods are not robust enough when applied in real-world scenes, which seriously affect their application value in urban planning and management. In addition, how to extract valuable periodic feature from social sensing data still needs to be further study. To this end, we propose a multi-modal fusion network guided by feature cooccurrence for urban region function recognition, which leverages the cooccurrence relationship between multi-modal features to identify abnormal noise feature, so as to guide the fusion network to suppress noise feature and focus on clean feature. Furthermore, we employ a graph convolutional network that incorporates node weighting layer and interactive update layer to effectively extract valuable periodic feature from social sensing data. Lastly, experimental results on public available datasets indicate that our proposed method yeilds promising improvements of both accuracy and robustness over several state-of-the-art methods.