Generating Images with Sparse Representations

Nash, Charlie; Menick, Jacob; Dieleman, Sander; Battaglia, Peter W.

doi:10.48550/arxiv.2103.03841

Cited by 16 publications

(22 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The DRConv proposes to dynamically select the CNN filters whereas there is still local context exploited, while the DynamicViT dynamically sparsifies tokens, which may underperform on dense prediction tasks due to the attenuation of fine-grained local interactions. The DCTransformer [18] transits the view of solving the problem into frequency domain and demonstrates the sparse representations can carry sufficient information for generating images. Similarly, the work [39] also converts the input image into frequency domain for visual understanding.…”

Section: Redundancy Reduction Methodsmentioning

confidence: 99%

NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition

Líu¹,

Jiang²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recently, Vision Transformers (ViT), with the selfattention (SA) as the de facto ingredients, have demonstrated great potential in the computer vision community. For the sake of trade-off between efficiency and performance, a group of works merely perform SA operation within local patches, whereas the global contextual information is abandoned, which would be indispensable for visual recognition tasks. To solve the issue, the subsequent global-local ViTs take a stab at marrying local SA with global one in parallel or alternative way in the model. Nevertheless, the exhaustively combined local and global context may exist redundancy for various visual data, and the receptive field within each layer is fixed. Alternatively, a more graceful way is that global and local context can adaptively contribute per se to accommodate different visual data. To achieve this goal, we in this paper propose a novel ViT architecture, termed NomMer, which can dynamically Nominate the synergistic global-local context in vision transforMer. By investigating the working pattern of our proposed NomMer, we further explore what context information is focused. Beneficial from this "dynamic nomination" mechanism, without bells and whistles, the NomMer can not only achieve 84.5% Top-1 classification accuracy on ImageNet with only 73M parameters, but also show promising performance on dense prediction tasks, i.e., object detection and semantic segmentation. The code and models will be made publicly available at https://github.com/NomMer1125/NomMer.

show abstract

Section: Redundancy Reduction Methodsmentioning

confidence: 99%

NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition

Líu¹,

Jiang²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Further, recent works have shown promise for storing compressed datasets as functions (Dupont et al, 2021a;Chen et al, 2021;Strümpler et al, 2021;Zhang et al, 2021). Using our framework, it may therefore become possible to train deep learning models directly on these compressed datasets, which is challenging for traditional compressed formats such as JPEG (although image-specific exceptions such as Nash et al (2021) exist). In addition, learning distributions of functa is likely to improve entropy coding and hence compression for these frameworks (Ballé et al, 2016).…”

Section: Conclusion Limitations and Future Workmentioning

confidence: 99%

From data to functa: Your data point is a function and you can treat it like one

Dupont¹,

Kim²,

Eslami³

et al. 2022

Preprint

View full text Add to dashboard Cite

It is common practice in deep learning to represent a measurement of the world on a discrete grid, e.g. a 2D grid of pixels. However, the underlying signal represented by these measurements is often continuous, e.g. the scene depicted in an image. A powerful continuous alternative is then to represent these measurements using an implicit neural representation, a neural function trained to output the appropriate measurement value for any input spatial location. In this paper, we take this idea to its next level: what would it take to perform deep learning on these functions instead, treating them as data? In this context we refer to the data as functa, and propose a framework for deep learning on functa. This view presents a number of challenges around efficient conversion from data to functa, compact representation of functa, and effectively solving downstream tasks on functa. We outline a recipe to overcome these challenges and apply it to a wide range of data modalities including images, 3D shapes, neural radiance fields (NeRF) and data on manifolds. We demonstrate that this approach has various compelling properties across data modalities, in particular on the canonical tasks of generative modeling, data imputation, novel view synthesis and classification.

show abstract

“…1 we evaluate our approach against a variety of other models in terms of Precision, Recall, Density, and Coverage (PRDC) [44,50,63], metrics that quantify the overlap between the data and sample distributions. Due to limited computing resources, we are unable to provide density and coverage scores for DCT [51] and PRDC scores for StyleGAN2 on LSUN Bedroom since training on a standard GPU would take more than 30 days per experiment, signif- icantly more than the 10 days required to train our models. On the LSUN datasets our approach achieves the highest Precision, Density, and Coverage; indicating that the data and sample manifolds have the most overlap.…”

Section: Sample Qualitymentioning

confidence: 99%

“…In this work, we compare approaches using Precision and Recall [63] approaches which, unlike FID, evaluate sample quality and diversity separately and have been used in similar recent work assessing high-resolution image generation [30,37,51,59]. Precision is the expected likelihood of fake samples lying on the data manifold and recall vice versa.…”

Section: Limitations Of Fid Metricmentioning

confidence: 99%

“…To compute these measures we use the official code releases and pretrained weights in all cases except Taming Transformers on the LSUN datasets where weights were not available; in this case we reproduced results as close as possible with the hardware available, training the VQGANs and autoregressive models with the same hyperparameters used for the rest of our experiments. Following Nash et al [51] we use the standard 2048D InceptionV3 features, which are also used to compute FID. The measures are computed using the code provided by Naeem et al [50].…”

Section: Supplementary Materialsmentioning

confidence: 99%

See 1 more Smart Citation

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

Bond-Taylor¹,

Hessey²,

Sasaki³

et al. 2021

Preprint

View full text Add to dashboard Cite

Whilst diffusion probabilistic models can generate high quality image content, key limitations remain in terms of both generating high-resolution imagery and their associated high computational requirements. Recent Vector-Quantized image models have overcome this limitation of image resolution but are prohibitively slow and unidirectional as they generate tokens via element-wise autoregressive sampling from the prior. By contrast, in this paper we propose a novel discrete diffusion probabilistic model prior which enables parallel prediction of Vector-Quantized tokens by using an unconstrained Transformer architecture as the backbone. During training, tokens are randomly masked in an order-agnostic manner and the Transformer learns to predict the original tokens. This parallelism of Vector-Quantized token prediction in turn facilitates unconditional generation of globally consistent high-resolution and diverse imagery at a fraction of the computational expense. In this manner, we can generate image resolutions exceeding that of the original training set samples whilst additionally provisioning per-image likelihood estimates (in a departure from generative adversarial approaches). Our approach achieves state-of-the-art results in terms of Density (LSUN

show abstract

Generating Images with Sparse Representations

Cited by 16 publications

References 0 publications

NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition

NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition

From data to functa: Your data point is a function and you can treat it like one

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

Contact Info

Product

Resources

About