In recent decades, with the ever-growing scale of video data, near-duplicate videos continue to emerge. Data quality issues caused by near-duplicate videos are becoming more and more prominent, which has affected the application of normal videos. Although current studies on near-duplicate video detection can help uncover data quality issues for videos, they still lack a process of automatic merging for the video data represented by high-dimensional features, which makes it difficult to automatically clean the near-duplicate videos to improve data quality for video datasets. At present, there are few studies on near-duplicate video data cleaning. The existing studies have the sensitive problems of video data orderliness and initial clustering centers under a condition that prior distribution is unknown, which seriously affects the accuracy of near-duplicate video data cleaning. To address the above issues, an automatic near-duplicate video data cleaning method based on a consistent feature hash ring is proposed in this paper. First, a residual network with convolutional block attention modules, a long short-term memory deep network, and an attention model are integrated to construct an RCLA deep network with the multi-head attention mechanism to extract spatiotemporal features of video data. Then, a consistent feature hash ring is constructed, which can effectively alleviate the sensitivity of video data orderliness while providing a condition of near-duplicate video merging. To reduce the sensitivity of the initial cluster centers to the results of near-duplicate video cleansing, an optimized feature distance-means clustering algorithm is constructed by utilizing a mountain peak function on a consistent feature hash ring, which can implement automatic cleaning of near-duplicate video data. Finally, experiments are conducted based on a commonly used dataset named CC_WEB_VIDEO and a coal mining video dataset. Compared with some existing studies, simulation results demonstrate the performance of the proposed method.