2022
DOI: 10.48550/arxiv.2205.10337
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

Abstract: We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In contrast to previous models, UViM has the same functional form for all tasks; it requires no task-specific modifications which require extensive human expertise. The approach involves two components: (I) a base model (feedforward) which is trained to directly predict raw vision outputs, guided by a learned discrete code and (II) a language model (autoregressive) that is trained to generate the guiding code. Thes… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 37 publications
0
7
0
Order By: Relevance
“…Pix2Seq-D achieves competitive Panoptic Quality (PQ) to state-of-theart methods with the ResNet-50 backbone. When compared with other recent generalist models such as UViM [31], our model performs significantly better while being much more efficient. Similar results are obtained for Cityscape, the details of which are given in Appendix C. Table 2 compares Pix2Seq-D to state-of-the-art methods on unsupervised video object segmentation on DAVIS, using the standard J &F metrics [46].…”
Section: Resultsmentioning
confidence: 90%
See 1 more Smart Citation
“…Pix2Seq-D achieves competitive Panoptic Quality (PQ) to state-of-theart methods with the ResNet-50 backbone. When compared with other recent generalist models such as UViM [31], our model performs significantly better while being much more efficient. Similar results are obtained for Cityscape, the details of which are given in Appendix C. Table 2 compares Pix2Seq-D to state-of-the-art methods on unsupervised video object segmentation on DAVIS, using the standard J &F metrics [46].…”
Section: Resultsmentioning
confidence: 90%
“…Eshewing task-specific architectures and loss functions, recent generalist vision models, such as Pix2Seq [10,11], OFA [60], UViM [31], and Unified I/O [43], advocate a generic, task-agnostic framework, generalizing across multiple tasks while being much simpler than previous models. For instance, Pix2Seq [10,11] formulates a set of core vision tasks in terms of the generation of semantically meaningful sequences conditioned on an image, and they train a single autoregressive model based on Transformers [55].…”
Section: Introductionmentioning
confidence: 99%
“…Depth Estimation. On depth estimation, UNIFIED-IO achieves 0.385 rmse, which is behind stateof-the-art but ahead of recently proposed unified model, UViM Kolesnikov et al (2022), despite being trained to do far more tasks.…”
Section: Results On Additional Tasksmentioning
confidence: 93%
“…Another concurrent work, UViM (Kolesnikov et al, 2022) proposes a unified model for producing visual outputs and applying it to the pixel-labeling tasks of panoptic segmentation, depth prediction, and colorization. Similar to UNIFIED-IO, UViM uses a generative head to predict output tokens that are then used as input to a second model to construct an output image.…”
Section: Introductionmentioning
confidence: 99%
“…For example, unification of different VL understanding tasks can be achieved relatively easily (e.g., SimVLM (Wang et al, 2022k), GIT (Wang et al, 2022d), CoCa (Yu et al, 2022a)), while the unification of VL understanding tasks and region-level localization tasks can be much more challenging (e.g., UniTAB (Yang et al, 2021c), GLIP (Li et al, 2022h), and GLIPv2 (Zhang et al, 2022b)), not to mention the unification of image generation tasks (e.g., OFA (Wang et al, 2022e) and Unified-IO (Lu et al, 2022a)). Pix2seqV2 (Chen et al, 2022d) and UViM (Kolesnikov et al, 2022) also propose unified approaches for computer vision tasks. MetaLM (Hao et al, 2022) shows that language models can be a general-purpose interface for many diverse tasks.…”
Section: Towards Building General-purpose Multimodal Foundation Modelsmentioning
confidence: 99%