2021
DOI: 10.48550/arxiv.2108.05015
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows

Abstract: Different from visible cameras which record intensity images frame by frame, the biologically inspired event camera produces a stream of asynchronous and sparse events with much lower latency. In practice, the visible cameras can better perceive texture details and slow motion, while event cameras can be free from motion blurs and have a larger dynamic range which enables them to work well under fast motion and low illumination. Therefore, the two sensors can cooperate with each other to achieve more reliable … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
27
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
2

Relationship

1
8

Authors

Journals

citations
Cited by 12 publications
(27 citation statements)
references
References 71 publications
(99 reference statements)
0
27
0
Order By: Relevance
“…The first group mainly comprises event-based datasets for object recognition (e.g., DVS-Gesture [114], N-CARS [145], and ALS-DVS [113]) and action recognition (e.g., DVS-PAF [146] and DHP19 [147]). In the second group, neuromorphic datasets for regression tasks include image reconstruction (e.g., CED [148] and BS-ERGB [149]), object detection (e.g., PKU-DDD17-CAR [150], 1Mpx Automotive Detection [151], and PKU-DAVIS-SOD [328]), object tracking (e.g., FED240hz [152] and VisEvent [153]), depth estimation (e.g., MVSEC [154] and DESC [155]), and SLAM (e.g., UZH-FPV [156] and VECtor [157]), etc. As shown in Table VII, most of the existing real-world datasets are for object recognition tasks, and few for complex regression tasks, especially scene segmentation datasets with pixel-level annotation.…”
Section: B Categorizationmentioning
confidence: 99%
“…The first group mainly comprises event-based datasets for object recognition (e.g., DVS-Gesture [114], N-CARS [145], and ALS-DVS [113]) and action recognition (e.g., DVS-PAF [146] and DHP19 [147]). In the second group, neuromorphic datasets for regression tasks include image reconstruction (e.g., CED [148] and BS-ERGB [149]), object detection (e.g., PKU-DDD17-CAR [150], 1Mpx Automotive Detection [151], and PKU-DAVIS-SOD [328]), object tracking (e.g., FED240hz [152] and VisEvent [153]), depth estimation (e.g., MVSEC [154] and DESC [155]), and SLAM (e.g., UZH-FPV [156] and VECtor [157]), etc. As shown in Table VII, most of the existing real-world datasets are for object recognition tasks, and few for complex regression tasks, especially scene segmentation datasets with pixel-level annotation.…”
Section: B Categorizationmentioning
confidence: 99%
“…The first group mainly comprises event-based datasets for object recognition (e.g., DVS-Gesture [114], N-CARS [145], and ALS-DVS [113]) and action recognition (e.g., DVS-PAF [146] and DHP19 [147]). In the second group, neuromorphic datasets for regression tasks include image reconstruction (e.g., CED [148] and BS-ERGB [149]), object detection (e.g., PKU-DDD17-CAR [150], 1Mpx Automotive Detection [151], and PKU-DAVIS-SOD [328]), object tracking (e.g., FED240hz [152] and VisEvent [153]), depth estimation (e.g., MVSEC [154] and DESC [155]), and SLAM (e.g., UZH-FPV [156] and VECtor [157]), etc. As shown in Table VII, most of the existing real-world datasets are for object recognition tasks, and few for complex regression tasks, especially scene segmentation datasets with pixel-level annotation.…”
Section: B Categorizationmentioning
confidence: 99%
“…For example, the widely used RGB-based tracking datasets, GOT-10k [13], Track-ingNet [36], and LaSOT [9], contain 9.3K, 30.1K, and 1.1K sequences, corresponding to 1.4M, 14M, and 2.8M frames for training. Whereas the largest training datasets in multi-modal tracking, DepthTrack [47], LasHeR [25], VisEvent [43], contain 150, 979, 500 training sequences, corresponding to 0.22M, 0.51M, 0.21M annotated frame pairs, which is at least an order of magnitude less than the former. Accounting for the above limitation, multi-modal tracking methods [43,47,61] usually utilize pre-trained RGB-based trackers and perform fine-tuning on their taskoriented training sets (as shown in Figure 1 (a)→(b)).…”
Section: Introductionmentioning
confidence: 99%
“…Whereas the largest training datasets in multi-modal tracking, DepthTrack [47], LasHeR [25], VisEvent [43], contain 150, 979, 500 training sequences, corresponding to 0.22M, 0.51M, 0.21M annotated frame pairs, which is at least an order of magnitude less than the former. Accounting for the above limitation, multi-modal tracking methods [43,47,61] usually utilize pre-trained RGB-based trackers and perform fine-tuning on their taskoriented training sets (as shown in Figure 1 (a)→(b)). DeT [47] adds a depth feature extraction branch to the original ATOM [7] or DiMP [3] tracker and fine-tunes on RGB-D training data.…”
Section: Introductionmentioning
confidence: 99%