Abstract. It is a difficult task to make machine understanding video and describe it in natural language. In the reality, videos are much longer than these video clips in research experiments, each video contains multi parts of semantic. It is a challenge work to describe a long video, it requires to control the granularity of the video's semantics, exclude redundancy information and give complete description. This task is very important for video understanding and video retrieving. In the paper, we proposed a framework to solve these problems. The framework consists of two stage: video segmentation and video description, the two stage can divide into five steps, firstly extracts features of video sequence with pre-trained deep learning models, secondly fuse different features of a same frame into a feature vector with a weight vector, thirdly generates a histogram of similarity (HOS) of adjacent frames' feature vectors in sequence, fourthly uses a threshold t to divide the video into short fragments of different semantic, finally uses LSTM networks which take frame sequences' features of each fragment as input and output natural language description for each fragment. Our research handles the 'in-the-wild' long videos, it can enhance the comprehensibility of long video, it is meaningful in the task of understanding and describing video.