2022
DOI: 10.1609/aaai.v36i1.20007
|View full text |Cite
|
Sign up to set email alerts
|

Joint 3D Object Detection and Tracking Using Spatio-Temporal Representation of Camera Image and LiDAR Point Clouds

Abstract: In this paper, we propose a new joint object detection and tracking (JoDT) framework for 3D object detection and tracking based on camera and LiDAR sensors. The proposed method, referred to as 3D DetecTrack, enables the detector and tracker to cooperate to generate a spatio-temporal representation of the camera and LiDAR data, with which 3D object detection and tracking are then performed. The detector constructs the spatio-temporal features via the weighted temporal aggregation of the spatial features obtaine… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(3 citation statements)
references
References 36 publications
0
3
0
Order By: Relevance
“…Intuitively, taking the location information as an example, depth features represent the relative location information while point features represent the absolute location information, and it is important to weigh their contributions to the information fusion accordingly. To achieve this, we adopt an attentive block in (You et al 2019;Koh et al 2022), which adaptively fuses the location information between the depth and point features. The fused feature f k d(l) of the location information is obtained through cross-attentive addition, which can be formulated as:…”
Section: Multimodal Point Cloud Compressionmentioning
confidence: 99%
“…Intuitively, taking the location information as an example, depth features represent the relative location information while point features represent the absolute location information, and it is important to weigh their contributions to the information fusion accordingly. To achieve this, we adopt an attentive block in (You et al 2019;Koh et al 2022), which adaptively fuses the location information between the depth and point features. The fused feature f k d(l) of the location information is obtained through cross-attentive addition, which can be formulated as:…”
Section: Multimodal Point Cloud Compressionmentioning
confidence: 99%
“…They use object detection as object nodes and represent edges as possible trajectory hypotheses. In contrast to trackers that use neighbor frame information, graph-based approaches define the cross-frame object association problem as a global combinatorial optimization problem (Koh et al 2022;Dendorfer et al 2020a;Brasó and Leal-Taixé 2020;Zeng et al 2022). To this end, many studies have used different optimization strategies, including multi-cuts (Tang et al 2017), minimal cliques (Zamir, Dehghan, and Shah 2012), network flow (Berclaz et al 2011;Butt and Collins 2013), and disjoint path approaches (Hornáková et al 2020(Hornáková et al , 2021.…”
Section: Graph-based Global Trackingmentioning
confidence: 99%
“…MPPNet [74] employs RoI grid pooling to extract features from proposals and create trajectories for temporal fusion. MGTANet [75] utilizes an SM-VFE module for encoding features and leverages a motion-guided deformable module to align and merge multi-frame features. D-align [76] designs a dual-query attention network to leverage both target frame features and support frame features.…”
Section: Multi-frame Detectionmentioning
confidence: 99%