Due to the emergence of graph convolutional networks (GCNs), the skeleton-based action recognition has achieved remarkable results. However, the current models for skeleton-based action analysis treat skeleton sequences as a series of graphs, aggregating features of the entire sequence by alternately extracting spatial and temporal features, i.e., using a 2D (spatial features) plus 1D (temporal features) approach for feature extraction. This undoubtedly overlooks the complex spatiotemporal fusion relationships between joints during motion, making it challenging for models to capture the connections between different temporal frames and joints. In this paper, we propose a Multimodal Graph Self-Attention Network (MGSAN), which combines GCNs with self-attention to model the spatiotemporal relationships between skeleton sequences. Firstly, we design graph self-attention (GSA) blocks to capture the intrinsic topology and long-term temporal dependencies between joints. Secondly, we propose a multi-scale spatio-temporal convolutional network for channel-wise topology modeling (CW-TCN) to model short-term smooth temporal information of joint movements. Finally, we propose a multimodal fusion strategy to fuse joint, joint movement, and bone flow, providing the model with a richer set of multimodal features to make better predictions. The proposed MGSAN achieves state-of-the-art performance on three large-scale skeleton-based action recognition datasets, with accuracy of 93.1% on NTU RGB+D 60 cross-subject benchmark, 90.3% on NTU RGB+D 120 cross-subject benchmark, and 97.0% on the NW-UCLA dataset. Code is available at https://github.com/lizaowo/MGSAN.