Due to the prevalence of video social media and the increase of user generated content, the Internet is filled with a large amount of unstructured data. Videos often contain multimodal data such as title, tags, images and audios. Therefore, fusion of multimodal features is a valid way for video topic detection. The titles and tags of videos are short and sparse, and they are high level semantics, whereas the audio and images of videos are low level semantics. It is not suitable to represent a video by directly fusing these features. To address the issue, an effective multimodal fusion method based on the transformer model is proposed for detecting video topics. First, video data is crawled from Bilibili platform, and the titles, tags and descriptions of videos are processed by deleting invalid symbols and null values. The audios are converted to text and texts are recognized from video covers. Second, the transformer-based model is applied to fuse the three forms of text from different modalities to represent videos with multi-dimensional vectors. Then the HDBSCAN and hierarchical clustering (HC) are compared by Silhouette coefficient when clustering videos for topic detection. In addition, we compare video topic clustering under multimodal and single-modal. Finally, the intensity and content evolution of video topics over time are analyzed in the paper. Experimental results with the real data collected from Bilibili verify the effectiveness of the proposed method for video topic detection and evolution.