2021
DOI: 10.48550/arxiv.2105.09999
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Convolutional Block Design for Learned Fractional Downsampling

Abstract: The layers of convolutional neural networks (CNNs) can be used to alter the resolution of their inputs, but the scaling factors are limited to integer values. However, in many image and video processing applications, the ability to resize by a fractional factor would be advantageous. One example is conversion between resolutions standardized for video compression, such as from 1080p to 720p. To solve this problem, we propose an alternative building block, formulated as a conventional convolutional layer follow… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 37 publications
0
1
0
Order By: Relevance
“…It is unlikely that this would significantly affect ResViT's sensitivity to local features since the primary component of ART that captures local features is the residual CNN module whose resolution can be preserved. If the input image does not have a power-oftwo size, the abovementioned strategies can be adopted after zero-padding to round up the resolution to the nearest power of two, or by implementing the encoder with non-integer downsampling rates [107]. Note that computer vision studies routinely fine-tune transformers at different image resolutions than encountered during pre-training without performance loss [49], so ResViT might also demonstrate similar behavior.…”
Section: Discussionmentioning
confidence: 99%
“…It is unlikely that this would significantly affect ResViT's sensitivity to local features since the primary component of ART that captures local features is the residual CNN module whose resolution can be preserved. If the input image does not have a power-oftwo size, the abovementioned strategies can be adopted after zero-padding to round up the resolution to the nearest power of two, or by implementing the encoder with non-integer downsampling rates [107]. Note that computer vision studies routinely fine-tune transformers at different image resolutions than encountered during pre-training without performance loss [49], so ResViT might also demonstrate similar behavior.…”
Section: Discussionmentioning
confidence: 99%