2021
DOI: 10.48550/arxiv.2104.09116
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

TransCrowd: Weakly-Supervised Crowd Counting with Transformer

Abstract: The mainstream crowd counting methods usually utilize the convolution neural network (CNN) to regress a density map, requiring point-level annotations. However, annotating each person with a point is an expensive and laborious process. During the testing phase, the point-level annotations are not considered to evaluate the counting accuracy, which means the point-level annotations are redundant. Hence, it is desirable to develop weakly-supervised counting methods that just rely on count-level annotations, a mo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
18
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(18 citation statements)
references
References 58 publications
0
18
0
Order By: Relevance
“…For vision tasks, images or features are first converted into sequences of vectors, the global interactions within which are then modelled by the transformers. Since the pioneer works such as ViT [12] and DETR [14], it has been shown to be effective in various tasks, including image classification [13], [31], [32], object detection [14], semantic/instance segmentation [15], [33], video segmentation [34], crowd counting [16], [35], depth estimation [36], [37], domain adaptation [38], [39], and virtual try-on [40]. In particular, ViT [12] divides the image into patches and converts them to sequences of features, which are then used as the input to the transformers.…”
Section: Transformermentioning
confidence: 99%
“…For vision tasks, images or features are first converted into sequences of vectors, the global interactions within which are then modelled by the transformers. Since the pioneer works such as ViT [12] and DETR [14], it has been shown to be effective in various tasks, including image classification [13], [31], [32], object detection [14], semantic/instance segmentation [15], [33], video segmentation [34], crowd counting [16], [35], depth estimation [36], [37], domain adaptation [38], [39], and virtual try-on [40]. In particular, ViT [12] divides the image into patches and converts them to sequences of features, which are then used as the input to the transformers.…”
Section: Transformermentioning
confidence: 99%
“…For the regression experiments, we investigate crowd counting, which is the problem of counting the total number of people present in a given image [15]. We use DISCO [16] as the dataset and TransCrowd [17] which is a ViT-based architecture as the backbone. Mean absolute error (MAE) is commonly used to evaluate the accuracy of crowd counting models [18].…”
Section: Resultsmentioning
confidence: 99%
“…Recently, natural language processing model Transformer [76] has gained much popularity in the computer vision community. When used in vision problems such as image classification [66,19,84,56,45,55,75], ob-ject detection [6,53,74,56], segmentation [84,99,56,4] and crowd counting [47,69], it learns to attend to important image regions by exploring the global interactions between different regions. Due to its impressive performance, Transformer has also been introduced for image restoration [9,5,82].…”
Section: Vision Transformermentioning
confidence: 99%