Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection in Autonomous Driving

Yuan, Zhenxun; Song, Xiaoning; Bai, Lei; Zhou, Wengang; Wang, Zhe; Ouyang, Wanli

doi:10.48550/arxiv.2011.13628

Cited by 1 publication

(1 citation statement)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An emerging thread of work aims at applying transformers to vision tasks such as object detection [5], semantic segmentation [115,99], 3D reconstruction [72], pose estimation [107], generative modeling [14], image retrieval [27], medical image segmentation [13,97,111], point clouds [40], video instance segmentation [103], object re-identification [47], video retrieval [33], video dialogue [64], video object detection [110] and multi-modal tasks [73,23,80,53,108]. A separate line of works attempts at modeling visual data with learnt discretized token sequences [104,83,14,109,18].…”

Section: Related Workmentioning

confidence: 99%

Multiscale Vision Transformers

Fan

Xiong

Mangalam

et al. 2021

Preprint

121

View full text Add to dashboard Cite

We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10× more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https: //github.com/facebookresearch/SlowFast.

show abstract