The semantic analysis of nasal endoscopic video is a challenging task since lots of irrelevant and insignificant information exists in the untrimmed surgical video, i.e. background, blur, judder or bloodstained video fragments. It is important to identify the start and end point of the valid surgical fragments automatically and remove the invalid fragments of endoscopic surgery videos for medical education & research. However, the performance of deep-learning based methods, which use a fixed time interval and a sliding window, are severely affected when the interference information appears randomly in the nasal endoscopic video. Specifically, the surgical video is a continuous process globally, while many local discontinuity fragments are brought when endoscope enters and exits the cavity frequently. Hence, we propose a multi-granularity semantic analysis framework that can simultaneously meet the accuracy and timeliness required for endoscopic surgery video semantic analysis. Our approach is an end-to-end solution. First, a joint model is created to extract the temporal-spatial features of the surgical video on a coarse-grained scale. Meanwhile, an attention mechanism is used to automatically select the informative spatial features of endoscopic video. Second, a hierarchical self-correction module is proposed to correct the boundaries of the surgical operation iteratively on a fine-grained scale. Finally, we justify the proposed network through extensive experiments and quantitative comparisons against other state-of-the-art approaches. We achieve a good performance in terms of accuracy and efficiency.