Most videos consist of multiple clips. Namely, videos usually have multiple spatio-temporal information. However, existing video datasets are of single video clips.Additionally, content-based video retrieval methods have been implemented based on a single query clip. However, a single query may not be sufficient for better expressing intention concerning the query. Besides, it narrows the capability to search for multiple semantics. How to deal with retrieving videos having multiple clips has not been adequately investigated. This study proposes a novel multilabel video retrieval method that utilizes multiple video clips as queries. Whatever the number of queries, it transforms a multi-query video retrieval into a single-query video retrieval. The method is independent of both the video features and distance metric types. However, we propose a deep video hashing method to achieve speed and efficiency. Experiments performed on three prepared multilabel video datasets have proven the effectiveness of the proposed method. The source codes of the study are public.
K E Y W O R D Scontent-based video retrieval, multi-query optimal point, multiple video queries, objective space, video hashing
INTRODUCTIONContent-based video retrieval (CBVR) is the process of retrieving videos from a video database by querying them based on visual and temporal content. The visual content is usually an object, face, color, texture, or scene at the frame level. On the other hand, the temporal content is an action type at the video level. Videos frames, each of which is an image, usually have multilabel characters. At the video level, the temporal relation between frames also expresses a piece of semantic information. 1,2 It is called action type. An activity on a football field other than playing football tends to the football semantic in terms of frame-level. However, at the video level, the action type is not football.Neglecting temporal dependencies is sometimes an obligation, but sometimes a deficiency. [3][4][5] For face or object retrieval, for example, it is not necessitated. [6][7][8][9] However, some CBVR studies fundamentally ignore temporal properties. Janwe et al. used video segmentation to get the video shots. 10 From each shot, keyframes that adequately represent the semantic meaning of each shot are extracted. Then the keyframe features are extracted and are trained in a multi-stream CNN classifier. Finally, the classification scores of each shot are fused to obtain the final class score. Similar studies are based on the shot boundary detection and keyframe extraction technique. 11,12 In such studies, video semantics are only considered on the frame level. There have been various studies considering spatio-temporal characteristics. [13][14][15][16][17] On the other hand, few studies only consider actions but ignore pixel-level features. 14,18 The most concise video scene with at least one semantic meaning at the video level is called a clip. Naturally, videos made up of clips are multilabel. The use of a single video clip qu...