2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01172
|View full text |Cite
|
Sign up to set email alerts
|

MuIT: An End-to-End Multitask Learning Transformer

Abstract: We introduce the first multitasking vision transformer adapters that learn generalizable task affinities which can be applied to novel tasks and domains. Integrated into an off-the-shelf vision transformer backbone, our adapters can simultaneously solve multiple dense vision tasks in a parameter-efficient manner, unlike existing multitasking transformers that are parametrically expensive. In contrast to concurrent methods, we do not require retraining or fine-tuning whenever a new task or domain is added. We i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
32
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 45 publications
(32 citation statements)
references
References 78 publications
0
32
0
Order By: Relevance
“…However, this limits the performance of dense prediction tasks whose performances are significantly influenced by the resolution of the feature maps generated by the model. Simultaneous to our work, several studies have investigated multi-task transformers [15], [16], [17]. However, their attention maps still possess limited context and lack the capability to model spatial and cross-task interactions globally among features of different tasks.…”
Section: Global Spatial Interaction Simultaneous All-task Interactionmentioning
confidence: 99%
See 1 more Smart Citation
“…However, this limits the performance of dense prediction tasks whose performances are significantly influenced by the resolution of the feature maps generated by the model. Simultaneous to our work, several studies have investigated multi-task transformers [15], [16], [17]. However, their attention maps still possess limited context and lack the capability to model spatial and cross-task interactions globally among features of different tasks.…”
Section: Global Spatial Interaction Simultaneous All-task Interactionmentioning
confidence: 99%
“…MTFormer [16] and MQTransformer [17] develop different cross-task information fusion modules, yet the multi-task information cannot interact with each other in a spatially global context. Similar to MTAN [26], the task-specific modules of MulT [15] directly query the task-shared backbone feature to acquire task-specific features but are unable to model cross-task interactions. MultiMAE [34] learns a versatile backbone model but requires fine-tuning on each single task, making it essentially a series of single-task models instead of one multi-task model.…”
Section: Multi-task Learning For Dense Scene Understandingmentioning
confidence: 99%
“…Our DeMT obtains 46.36 SemSeg accuracy, which is 6.3% higher than that of MQTransformer with the same Swin-T backbone and slightly lower FLOPs (100.7G vs. 106.02G). MuIT (Bhattacharjee et al 2022) reports a 13.3% and 8.54% increase in relative performance for semantic…”
Section: Comparison With the State-of-the-artmentioning
confidence: 99%
“…[13] proposed a framework that tackles several language tasks but a single vision one. Differently, MulT [5] introduced a multitask transformer to handle multiple vision tasks. More complex vision transformer architectures have demonstrated that they outperform Convolutional Neural Network (CNN) based multitasking methods.…”
Section: Multitask Learning With Vision Transformersmentioning
confidence: 99%
“…This, in turn, allows us to achieve dense prediction in the comics domain by leveraging supervision from real-world annotations. Note that DTA differs from the co-attention introduced in prior works [5,6], wherein both cases, the attention is computed based on a specific task. By contrast, we learn a domain transferable attention between different domains.…”
Section: Domain Transferable Attentionmentioning
confidence: 99%