SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation

Li, Bing; Cheng, Zheng; Giancola, Silvio; Ghanem, Bernard

doi:10.48550/arxiv.2105.04447

Cited by 4 publications

(4 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2D motion estimation is also referred as optical flow estimation [14,27,57,58], which aims at finding pixel-wise motions between consecutive images. Recently, 2D motion estimation has been extended to 3D domain with point convolution [30,32] and 3D convolution [21,31]. These works show the potential of 3D motion estimation and compensation for dynamic point cloud compression.…”

Section: Motion Estimationmentioning

confidence: 99%

4DAC: Learning Attribute Compression for Dynamic Point Clouds

Fang¹,

Hu²,

Yan³

et al. 2022

Preprint

View full text Add to dashboard Cite

With the development of the 3D data acquisition facilities, the increasing scale of acquired 3D point clouds poses a challenge to the existing data compression techniques. Although promising performance has been achieved in static point cloud compression, it remains under-explored and challenging to leverage temporal correlations within a point cloud sequence for effective dynamic point cloud compression. In this paper, we study the attribute (e.g., color) compression of dynamic point clouds and present a learningbased framework, termed 4DAC. To reduce temporal redundancy within data, we first build the 3D motion estimation and motion compensation modules with deep neural networks. Then, the attribute residuals produced by the motion compensation component are encoded by the region adaptive hierarchical transform into residual coefficients. In addition, we also propose a deep conditional entropy model to estimate the probability distribution of the transformed coefficients, by incorporating temporal context from consecutive point clouds and the motion estimation/compensation modules. Finally, the data stream is losslessly entropy coded with the predicted distribution. Extensive experiments on several public datasets demonstrate the superior compression performance of the proposed approach. CCS CONCEPTS• Computing methodologies → Artificial intelligence; Machine learning.

show abstract

Section: Motion Estimationmentioning

confidence: 99%

4DAC: Learning Attribute Compression for Dynamic Point Clouds

Fang¹,

Hu²,

Yan³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Since its promise firstly demonstrated in Vision Transformer (ViT) [23], we have witnessed a flourish of full-Transformer models for image classification [57,63,67,44,80,59], object detection [9,91,84,20] and semantic segmentation [61,65]. Beyond these static image tasks, it has also been applied on various temporal understanding tasks, such as action recognition [41,83,11], object tracking [15,62], scene flow estimation [39].…”

Section: Introductionmentioning

confidence: 99%

Focal Self-attention for Local-Global Interactions in Vision Transformers

Yang,

Li,

Zhang

et al. 2021

Preprint

View full text Add to dashboard Cite

Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short-and long-range visual dependencies through self-attention is the key to success. But it also brings challenges due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection). Many recent works have attempted to reduce the computational and memory cost and improve performance by applying either coarse-grained global attentions or fine-grained local attentions. However, both approaches cripple the modeling power of the original self-attention mechanism of multi-layer Transformers, thus leading to sub-optimal solutions. In this paper, we present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions. In this new mechanism, each token attends its closest surrounding tokens at fine granularity and the tokens far away at coarse granularity, and thus can capture both short-and long-range visual dependencies efficiently and effectively. With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art (SoTA) vision Transformers on a range of public image classification and object detection benchmarks. In particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve 83.5% and 83.8% Top-1 accuracy, respectively, on ImageNet classification at 224 × 224. When employed as the backbones, Focal Transformers achieve consistent and substantial improvements over the current SoTA Swin Transformers [44] across 6 different object detection methods. Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation, creating new SoTA on three of the most challenging computer vision tasks.Preprint. Under review.

show abstract

“…Transformers [60], originally proposed for natural language processing (NLP), have become a prevalent architecture in computer vision since the seminal work of Vision Transformer (ViT) [16]. Its promise has been demonstrated in various vision tasks including image classification [56,63,67,41,78,59], object detection [3,87,83,12], segmentation [61,65,9], and beyond [35,81,4,7,62,33].…”

Section: Introductionmentioning

confidence: 99%

Focal Modulation Networks

Yang¹,

Li²,

Gao³

2022

Preprint

View full text Add to dashboard Cite

In this work, we propose focal modulation network (FocalNet in short), where self-attention (SA) is completely replaced by a focal modulation module that is more effective and efficient for modeling token interactions. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges at different granularity levels, (ii) gated aggregation to selectively aggregate context features for each visual token (query) based on its content, and (iii) modulation or element-wise affine transformation to fuse the aggregated features into the query vector. Extensive experiments show that FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin Transformers) with similar time and memory cost on the tasks of image classification, object detection, and semantic segmentation. Specifically, our FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224 2 and 384 2 , respectively. FocalNets exhibit remarkable superiority when transferred to downstream tasks. For object detection with Mask R-CNN, our FocalNet base trained with 1× already surpasses Swin trained with 3× schedule (49.0 v.s. 48.5). For semantic segmentation with UperNet, FocalNet base evaluated at single-scale outperforms Swin evaluated at multi-scale (50.5 v.s. 49.7). These results render focal modulation a favorable alternative to SA for effective and efficient visual modeling in real-world applications.Code is available at: https://github.com/microsoft/FocalNet.

show abstract

SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation

Cited by 4 publications

References 52 publications

4DAC: Learning Attribute Compression for Dynamic Point Clouds

4DAC: Learning Attribute Compression for Dynamic Point Clouds

Focal Self-attention for Local-Global Interactions in Vision Transformers

Focal Modulation Networks

Contact Info

Product

Resources

About