Temporal sentence grounding aims to ground a query sentence into a specific segment of the video. Previous methods follow the common equally-spaced frame selection mechanism for appearance and motion modeling, which fails to consider redundant and distracting visual information. There is also no guarantee that all meaningful frames can be obtained. Moreover, this task needs to detect the location clues precisely from both spatial and temporal dimensions, but the relationship between spatialtemporal semantic information and query sentence is still unexplored in existing methods. Inspired by human thinking patterns, we propose a Coarse-to-Fine Spatial-Temporal Relationship Inference (CFSTRI) network to progressively localize fine-grained activity segments. Firstly, we present a coarse-grained crucial frame selection module, where the query-guided local difference context modeling from adjacent frames helps discriminate all the coarse boundary locations relevant to the sentence semantics, and the soft assignment vector of locally aggregated descriptors are employed to enhance the representation of selected frames. Then, we develop a fine-grained spatial-temporal relationship matching module to refine the coarse boundaries, which disentangles the spatial and temporal semantic information from query sentence to guide the excavation of visual grounding clues of corresponding dimensions. Furthermore, we devise a gated graph convolution network to incorporate the spatial-temporal semantic information by leveraging a gate operation to highlight frames referred to by the query sentence from spatial and temporal dimensions, and propagate fused information on the graph. Extensive experiments on two benchmark datasets demonstrate that our CFSTRI significantly outperforms most state-of-the-art methods.