2022
DOI: 10.3390/s22218583
|View full text |Cite
|
Sign up to set email alerts
|

Attention-Guided Disentangled Feature Aggregation for Video Object Detection

Abstract: Object detection is a computer vision task that involves localisation and classification of objects in an image. Video data implicitly introduces several challenges, such as blur, occlusion and defocus, making video object detection more challenging in comparison to still image object detection, which is performed on individual and independent images. This paper tackles these challenges by proposing an attention-heavy framework for video object detection that aggregates the disentangled features extracted from… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 61 publications
0
4
0
Order By: Relevance
“…Disentangled representation learning is a well-studied topic that tries to learn the representation of various independent components hidden behind the data. It has been applied to computer vision [ 26 ], recommendation [ 27 ], natural language processing [ 28 ], and other domains. Hamaguchi et al [ 29 ], for example, used the disentanglement technique to detect rare events and proposed a new method for learning disentangled representations from low-cost negative samples in a pair of observations disentangles as variant and invariant factors, respectively, representing mixed information related to trivial events and image content invariant to trivial events.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Disentangled representation learning is a well-studied topic that tries to learn the representation of various independent components hidden behind the data. It has been applied to computer vision [ 26 ], recommendation [ 27 ], natural language processing [ 28 ], and other domains. Hamaguchi et al [ 29 ], for example, used the disentanglement technique to detect rare events and proposed a new method for learning disentangled representations from low-cost negative samples in a pair of observations disentangles as variant and invariant factors, respectively, representing mixed information related to trivial events and image content invariant to trivial events.…”
Section: Related Workmentioning
confidence: 99%
“…In contrast, an effective multiscale scheme enables the model to remain invariant as the scale changes and capture more intrinsic patterns [ 39 ]. The paper [ 40 ] uses k -order polynomials of adjacency matrices to feature aggregate multiscale structural information and learn rich representations by establishing relationships between distant and near neighbors in this way, and similarly in [ 26 , 27 , 29 ].…”
Section: The Proposed Modelmentioning
confidence: 99%
“…Among them, the detection effect of the decoupling head is higher than that of the coupling head in most cases. Muralidhara et al [ 27 ] increased the upper limit of current video object detection methods by introducing DyHead detection heads into faster R-CNN, combining scale, space, and task aware attention, and achieving good results.…”
Section: Introductionmentioning
confidence: 99%
“…Video image salient target detection is to simulate human visual perception system, intelligently detect salient targets in video images from semantic level, and finally realize independent analysis and understanding of video image content [5][6][7][8][9][10][11]. Traditional target detection of video images is often used to distinguish the relevant classification of large categories of targets, in the case of complex and diverse image content, it can not capture enough visual cues, which makes it difficult to distinguish small differences between categories [12][13][14][15][16][17][18][19][20][21][22]. To solve this problem, it's impossible to rely on all kinds of artificial image annotation to prompt which areas the detection model needs to extract which target feature information.…”
Section: Introductionmentioning
confidence: 99%