2019
DOI: 10.48550/arxiv.1907.07804
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

OmniNet: A unified architecture for multi-modal multi-task learning

Abstract: Transformer is a popularly used neural network architecture, especially for language understanding. We introduce an extended and unified architecture which can be used for tasks involving a variety of modalities like image, text, videos, etc. We propose a spatio-temporal cache mechanism that enables learning spatial dimension of the input in addition to the hidden states corresponding to the temporal input sequence. The proposed architecture further enables a single model to support tasks with multiple input m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
19
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(19 citation statements)
references
References 17 publications
0
19
0
Order By: Relevance
“…This setup of task supervision is similar to the cascaded information architectures discussed in section 2.2.3. However, instead of hand-designing a hierarchy of tasks, this method performs a Figure 12: OmniNet architecture proposed in (Pramanik et al, 2019). Each modality has a separate network to handle inputs, and the aggregated outputs are processed by an encoder-decoder called the Central Neural Processor.…”
Section: Multi-modal Architecturesmentioning
confidence: 99%
See 2 more Smart Citations
“…This setup of task supervision is similar to the cascaded information architectures discussed in section 2.2.3. However, instead of hand-designing a hierarchy of tasks, this method performs a Figure 12: OmniNet architecture proposed in (Pramanik et al, 2019). Each modality has a separate network to handle inputs, and the aggregated outputs are processed by an encoder-decoder called the Central Neural Processor.…”
Section: Multi-modal Architecturesmentioning
confidence: 99%
“…Both (Nguyen and Okatani, 2019;Akhtar et al, 2019) are focused on a set of tasks which all share the same fixed set of modalities. Instead, (Kaiser et al, 2017) and (Pramanik et al, 2019) focus on building a "universal multi-modal multi-task model", in which a single model can handle multiple tasks with varying input domains. The architecture introduced in (Kaiser et al, 2017) is comprised of an input encoder, an I/O mixer, and an autoregressive decoder.…”
Section: Multi-modal Architecturesmentioning
confidence: 99%
See 1 more Smart Citation
“…However, such methods share a common goal of training a unified model over a group of tasks that performs well and limits requirements for task-specific parameters. Multi-task learning approaches have since been applied to numerous domains, such as forming sentence embeddings [46,51], solving computer vision tasks [26], and even performing multi-modal reasoning [37,39,41]. Several, more comprehensive, summaries of developments in the multi-task learning space are also available [45,59].…”
Section: Related Workmentioning
confidence: 99%
“…The dodecaDialogue task (Shuster et al, 2019) proposes twelve dialogue tasks, among which there are two language/vision tasks in which the agent has to generate a response for a given context. Other works try to exploit multi-task learning to improve on single-task model performance in discriminative tasks (Pramanik et al, 2019;Lu et al, 2019). Unfortunately, implementing multi-task learning using different datasets results is cumbersome (Subramanian et al, 2018).…”
Section: Grounded Language Learning Evaluationmentioning
confidence: 99%