State‐of‐the‐art object detection models rely on large‐scale datasets for training to achieve good precision. Without sufficient samples, the model can suffer from severe overfitting. Current explorations in few‐shot object detection are mainly divided into meta‐learning‐based methods and fine‐tuning‐based methods. However, existing models do not focus on how feature maps should be processed to present more accurate regions of interest (RoIs), leading to many non‐supporting RoIs. These non‐supporting RoIs can increase the burden of subsequent classification and even lead to misclassification. Additionally, catastrophic forgetting is inevitable in both few‐shot object detection models. Many models classify directly in low‐dimensional spaces due to insufficient resources, but this transformation of the data space can confuse some categories and lead to misclassification. To address these problems, the Feature Reconstruction Detector (FRDet) is proposed, a simple yet effective fine‐tune‐based approach for few‐shot object detection. FRDet includes a region proposal network (RPN) based on channel attention and space attention called Multi‐Attention RPN (MARPN) and a head based on feature reconstruction called Feature Reconstruction Head (FRHead). MARPN utilizes channel attention to suppress non‐supporting classes and spatial attention to enhance support classes based on Attention RPN, resulting in fewer but more accurate RoIs. Meanwhile, FRHead utilizes support features to reconstruct query RoI features through a closed‐form solution, allowing for a comprehensive and fine‐grained comparison. The model was validated on the PASCAL VOC, MS COCO, FSOD, and CUB200 datasets and achieved better results.
Dense video captioning aims to locate multiple events in an untrimmed video and generate captions for each event. Previous methods experienced difficulties in establishing the multimodal feature relationship between frames and captions, resulting in low accuracy of the generated captions. To address this problem, a novel Dense Video Captioning Model Based on Local Attention (DVCL) is proposed. DVCL employs a 2D temporal differential CNN to extract video features, followed by feature encoding using a deformable transformer that establishes the global feature dependence of the input sequence. Then DIoU and TIoU are incorporated into the event proposal match algorithm and evaluation algorithm during training, to yield more accurate event proposals and hence increase the quality of the captions. Furthermore, an LSTM based on local attention is designed to generate captions, enabling each word in the captions to correspond to the relevant frame. Extensive experimental results demonstrate the effectiveness of DVCL. On the ActivityNet Captions dataset, DVCL performs significantly better than other baselines, with improvements of 5.6%, 8.2%, and 15.8% over the best baseline in BLEU4, METEOR, and CIDEr, respectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.