Understanding agents’ motion behaviors under complex scenes is crucial for intelligent autonomous moving systems (like delivery robots and self-driving cars). It is challenging duo to the inherent uncertain of future trajectories and the large variation in the scene layout. However, most recent approaches ignored or underutilized the scenario information. In this work, a Multi-Granularity Scenarios Understanding framework, MGSU, is proposed to explore the scene layout from different granularity. MGSU can be divided into three modules: (1) A coarse-grained fusion module uses the cross-attention to fuse the observed trajectory with the semantic information of the scene. (2) The inverse reinforcement learning module generates optimal path strategy through grid-based policy sampling and outputs multiple scene paths. (3) The fine-grained fusion module integrates the observed trajectory with the scene paths to generate multiple future trajectories. To fully explore the scene information and improve the efficiency, we present a novel scene-fusion Transformer, whose encoder is used to extract scene features and the decoder is used to fuse scene and trajectory features to generate future trajectories. Compared with the current state-of-the-art methods, our method decreases the ADE errors by 4.3% and 3.3% by gradually integrating different granularity of scene information on SDD and NuScenes, respectively. The visualized trajectories demonstrate that our method can accurately predict future trajectories after fusing scene information.