TransCrowd: Weakly-Supervised Crowd Counting with Transformer

Liang, Dingkang; Chen, Xiwu; Xu, Wei; Zhou, Yu; Bai, Xiang

doi:10.48550/arxiv.2104.09116

Cited by 14 publications

(18 citation statements)

References 58 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For vision tasks, images or features are first converted into sequences of vectors, the global interactions within which are then modelled by the transformers. Since the pioneer works such as ViT [12] and DETR [14], it has been shown to be effective in various tasks, including image classification [13], [31], [32], object detection [14], semantic/instance segmentation [15], [33], video segmentation [34], crowd counting [16], [35], depth estimation [36], [37], domain adaptation [38], [39], and virtual try-on [40]. In particular, ViT [12] divides the image into patches and converts them to sequences of features, which are then used as the input to the transformers.…”

Section: Transformermentioning

confidence: 99%

Boosting Few-shot Semantic Segmentation with Transformers

Sun¹,

Liu²,

Liang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Due to the fact that fully supervised semantic segmentation methods require sufficient fully-labeled data to work well and can not generalize to unseen classes, few-shot segmentation has attracted lots of research attention. Previous arts extract features from support and query images, which are processed jointly before making predictions on query images. The whole process is based on convolutional neural networks (CNN), leading to the problem that only local information is used. In this paper, we propose a TRansformer-based Few-shot Semantic segmentation method (TRFS). Specifically, our model consists of two modules: Global Enhancement Module (GEM) and Local Enhancement Module (LEM). GEM adopts transformer blocks to exploit global information, while LEM utilizes conventional convolutions to exploit local information, across query and support features. Both GEM and LEM are complementary, helping to learn better feature representations for segmenting query images. Extensive experiments on PASCAL-5 i and COCO datasets show that our approach achieves new state-of-the-art performance, demonstrating its effectiveness. Code and pretrained models will be available at https://github.com/GuoleiSun/TRFS.

show abstract

Section: Transformermentioning

confidence: 99%

Boosting Few-shot Semantic Segmentation with Transformers

Sun¹,

Liu²,

Liang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…For the regression experiments, we investigate crowd counting, which is the problem of counting the total number of people present in a given image [15]. We use DISCO [16] as the dataset and TransCrowd [17] which is a ViT-based architecture as the backbone. Mean absolute error (MAE) is commonly used to evaluate the accuracy of crowd counting models [18].…”

Section: Resultsmentioning

confidence: 99%

Multi-Exit Vision Transformer for Dynamic Inference

Bakhtiarnia¹,

Zhang²,

Iosifidis³

2021

Preprint

View full text Add to dashboard Cite

Deep neural networks can be converted to multiexit architectures by inserting early exit branches after some of their intermediate layers. This allows their inference process to become dynamic, which is useful for time critical IoT applications with stringent latency requirements, but with time-variant communication and computation resources. In particular, in edge computing systems and IoT networks where the exact computation time budget is variable and not known beforehand. Vision Transformer is a recently proposed architecture which has since found many applications across various domains of computer vision. In this work, we propose seven different architectures for early exit branches that can be used for dynamic inference in Vision Transformer backbones. Through extensive experiments involving both classification and regression problems, we show that each one of our proposed architectures could prove useful in the trade-off between accuracy and speed.

show abstract

“…Recently, natural language processing model Transformer [76] has gained much popularity in the computer vision community. When used in vision problems such as image classification [66,19,84,56,45,55,75], ob-ject detection [6,53,74,56], segmentation [84,99,56,4] and crowd counting [47,69], it learns to attend to important image regions by exploring the global interactions between different regions. Due to its impressive performance, Transformer has also been introduced for image restoration [9,5,82].…”

Section: Vision Transformermentioning

confidence: 99%

SwinIR: Image Restoration Using Swin Transformer

Liang¹,

Cao²,

Sun³

et al. 2021

Preprint

View full text Add to dashboard Cite

Image restoration is a long-standing low-level vision problem that aims to restore high-quality images from lowquality images (e.g., downscaled, noisy and compressed images). While state-of-the-art image restoration methods are based on convolutional neural networks, few attempts have been made with Transformers which show impressive performance on high-level vision tasks. In this paper, we propose a strong baseline model SwinIR for image restoration based on the Swin Transformer. SwinIR consists of three parts: shallow feature extraction, deep feature extraction and high-quality image reconstruction. In particular, the deep feature extraction module is composed of several residual Swin Transformer blocks (RSTB), each of which has several Swin Transformer layers together with a residual connection. We conduct experiments on three representative tasks: image super-resolution (including classical, lightweight and real-world image super-resolution), image denoising (including grayscale and color image denoising) and JPEG compression artifact reduction. Experimental results demonstrate that SwinIR outperforms state-of-the-art methods on different tasks by up to 0.14∼0.45dB, while the total number of parameters can be reduced by up to 67%.

show abstract

TransCrowd: Weakly-Supervised Crowd Counting with Transformer

Cited by 14 publications

References 58 publications

Boosting Few-shot Semantic Segmentation with Transformers

Boosting Few-shot Semantic Segmentation with Transformers

Multi-Exit Vision Transformer for Dynamic Inference

SwinIR: Image Restoration Using Swin Transformer

Contact Info

Product

Resources

About