Surface defect detection aims to classify and locate a certain defect that exists in the target surface area. It is an important part of industrial quality inspection. Most of the research on surface defect detection are currently based on convolutional neural networks (CNNs), which are more concerned with local information and lack global perception. Thus, CNNs are unable to effectively extract the defect features. In this paper, a defect detection method based on the Swin transformer is proposed. The structure of the Swin transformer has been fine-tuned so that it has five scales of output, making it more suitable for defect detection tasks with large variations in target size. A bi-directional feature pyramid network is used as the feature fusion part to efficiently fuse to the extracted features. The focal loss is used as a loss function to weight the hard- and easy-to-distinguish samples, potentially making the model fit the surface defect data better. To reduce the number of parameters in the model, a shared detection head was chosen for result prediction. Experiments were conducted on the flange surface defect dataset and the steel surface defect dataset, respectively. Compared with the classical CNNs target detection algorithm, our method improves the mean average precision (mAP) by about 15.4%, while the model volume and detection speed are essentially the same as those of the CNNs-based method. The experimental results show that our proposed method is more competitive compared with CNNs-based methods and has some generality for different types of defects.
With the advent and rapid development of image tampering technology, it has become harmful to many aspects of our society. Thus, image tampering detection has been increasingly important. Although current forgery detection methods have achieved some success, the scale of the tampered areas in each forgery image are different, and previous methods do not take this into account. In this paper, we believe that the inability of the network to accommodate tampered regions of various sizes is the main reason for the low precision. To address the mentioned problem, we propose a neural network architecture called CAU-Net, which adds residual propagation and feedback, attention gate and Atrous Spatial Pyramid Pooling with CBAM to the U-Net. The Atrous Spatial Pyramid Pooling with CBAM can capture information from multiple scales and adapt to differently sized target areas. In addition, CAU-Net can solve the vanishing gradient issue and suppress the weight of untampered regions, and CAU-Net is an end-to-end network without redundant image processing; thus, it is fast to detect suspicious images. In the end, we optimize the proposed network structure by ablation study, and the experimental results and visualization results demonstrate that our network has a better performance on CASIA and NIST16 compared with state of the art methods.
Transformer models are now widely used for speech processing tasks due to their powerful sequence modeling capabilities. Previous work determined an efficient way to model speaker embeddings using the Transformer model by combining transformers with convolutional networks. However, traditional global self-attention mechanisms lack the ability to capture local information. To alleviate these problems, we proposed a novel global–local self-attention mechanism. Instead of using local or global multi-head attention alone, this method performs local and global attention in parallel in two parallel groups to enhance local modeling and reduce computational cost. To better handle local location information, we introduced locally enhanced location encoding in the speaker verification task. The experimental results of the VoxCeleb1 test set and the VoxCeleb2 dev set demonstrated the improved effect of our proposed global–local self-attention mechanism. Compared with the Transformer-based Robust Embedding Extractor Baseline System, the proposed speaker Transformer network exhibited better performance in the speaker verification task.
In the Internet of Things (IoT) era, various devices generate massive videos containing rich human relations. However, the long-distance transmission of huge videos may cause congestion and delays, and the large gap between the visual and relation spaces brings about difficulties for relation analysis. Hence, this study explores an edge-cloud intelligence framework and two algorithms for cooperative relation extraction and analysis from videos based on an IoT system. First, we exploit a cooperative mechanism on the edges and cloud, which can schedule the relation recognition and analysis subtasks from massive video streams. Second, we propose a Multi-Granularity relation recognition Model (MGM) based on coarse and fined granularity features. This means that better mapping is established for identifying relations more accurately. Specifically, we propose an entity graph based on Graph Convolutional Networks (GCN) with an attention mechanism, which can support comprehensive relationship reasoning. Third, we develop a Community Detection based on the Ensemble Learning model (CDEL), which leverages a heterogeneous skip-gram model to perform node embedding and detect communities. Experiments on SRIV datasets and four movie videos validate that our solution outperforms several competitive baselines.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.