“…To avoid distortion, other studies employed fixed-size histogram or other statistics to summarize the distribution of representations. Specifically, they generate video-level descriptors by computing statistics of features [31], [32], [33], [34], using Gaussian Mixture Model (GMM) [35], [36], [37], [38], [39], [40] or fisher vector [38], [41], etc. Although these methods summarize undistorted information, temporal relations between segments/frames, such as the order of events, are lost after creating the statistics.…”