OmniNet: A unified architecture for multi-modal multi-task learning

Pramanik, Subhojeet; Hussain, Aman

doi:10.48550/arxiv.1907.07804

Cited by 14 publications

(19 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This setup of task supervision is similar to the cascaded information architectures discussed in section 2.2.3. However, instead of hand-designing a hierarchy of tasks, this method performs a Figure 12: OmniNet architecture proposed in (Pramanik et al, 2019). Each modality has a separate network to handle inputs, and the aggregated outputs are processed by an encoder-decoder called the Central Neural Processor.…”

Section: Multi-modal Architecturesmentioning

confidence: 99%

“…Both (Nguyen and Okatani, 2019;Akhtar et al, 2019) are focused on a set of tasks which all share the same fixed set of modalities. Instead, (Kaiser et al, 2017) and (Pramanik et al, 2019) focus on building a "universal multi-modal multi-task model", in which a single model can handle multiple tasks with varying input domains. The architecture introduced in (Kaiser et al, 2017) is comprised of an input encoder, an I/O mixer, and an autoregressive decoder.…”

Section: Multi-modal Architecturesmentioning

confidence: 99%

“…The authors also demonstrate that the large degree of sharing between tasks yields significantly increased performance for tasks with limited training data. Instead of aggregating mechanisms from various modes of deep learning, (Pramanik et al, 2019) introduces an architecture called OmniNet with a spatio-temporal cache mechanism to learn dependencies across spatial dimensions of data as well as the temporal dimension. A diagram is shown in figure 12.…”

Section: Multi-modal Architecturesmentioning

confidence: 99%

See 2 more Smart Citations

Multi-Task Learning with Deep Neural Networks: A Survey

Crawshaw

2020

Preprint

121

118

View full text Add to dashboard Cite

Multi-task learning (MTL) is a subfield of machine learning in which multiple tasks are simultaneously learned by a shared model. Such approaches offer advantages like improved data efficiency, reduced overfitting through shared representations, and fast learning by leveraging auxiliary information. However, the simultaneous learning of multiple tasks presents new design and optimization challenges, and choosing which tasks should be learned jointly is in itself a non-trivial problem. In this survey, we give an overview of multi-task learning methods for deep neural networks, with the aim of summarizing both the well-established and most recent directions within the field. Our discussion is structured according to a partition of the existing deep MTL techniques into three groups: architectures, optimization methods, and task relationship learning. We also provide a summary of common multi-task benchmarks.

show abstract

Section: Multi-modal Architecturesmentioning

confidence: 99%

Section: Multi-modal Architecturesmentioning

confidence: 99%

Section: Multi-modal Architecturesmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Task Learning with Deep Neural Networks: A Survey

Crawshaw

2020

Preprint

121

118

View full text Add to dashboard Cite

show abstract

“…However, such methods share a common goal of training a unified model over a group of tasks that performs well and limits requirements for task-specific parameters. Multi-task learning approaches have since been applied to numerous domains, such as forming sentence embeddings [46,51], solving computer vision tasks [26], and even performing multi-modal reasoning [37,39,41]. Several, more comprehensive, summaries of developments in the multi-task learning space are also available [45,59].…”

Section: Related Workmentioning

confidence: 99%

Exceeding the Limits of Visual-Linguistic Multi-Task Learning

Wolfe,

Lundgaard

2021

Preprint

View full text Add to dashboard Cite

By leveraging large amounts of product data collected across hundreds of live e-commerce websites, we construct 1000 unique classification tasks that share similarly-structured input data, comprised of both text and images. These classification tasks focus on learning the product hierarchy of different e-commerce websites, causing many of them to be correlated. Adopting a multi-modal transformer model, we solve these tasks in unison using multi-task learning (MTL). Extensive experiments are presented over an initial 100-task dataset to reveal best practices for "large-scale MTL" (i.e., MTL with ≥ 100 tasks). From these experiments, a final, unified methodology is derived, which is composed of both best practices and new proposals such as DyPa, a simple heuristic for automatically allocating task-specific parameters to tasks that could benefit from extra capacity. Using our large-scale MTL methodology, we successfully train a single model across all 1000 tasks in our dataset while using minimal task specific parameters, thereby showing that it is possible to extend several orders of magnitude beyond current efforts in MTL. CCS CONCEPTS• Computing methodologies → Machine learning algorithms; Neural networks; Computer vision; Natural language processing.

show abstract

“…The dodecaDialogue task (Shuster et al, 2019) proposes twelve dialogue tasks, among which there are two language/vision tasks in which the agent has to generate a response for a given context. Other works try to exploit multi-task learning to improve on single-task model performance in discriminative tasks (Pramanik et al, 2019;Lu et al, 2019). Unfortunately, implementing multi-task learning using different datasets results is cumbersome (Subramanian et al, 2018).…”

Section: Grounded Language Learning Evaluationmentioning

confidence: 99%

CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning

Suglia¹,

Konstas²,

Vanzo³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Approaches to Grounded Language Learning typically focus on a single task-based final performance measure that may not depend on desirable properties of the learned hidden representations, such as their ability to predict salient attributes or to generalise to unseen situations. To remedy this, we present GROLLA, an evaluation framework for Grounded Language Learning with Attributes with three subtasks: 1) Goal-oriented evaluation; 2) Object attribute prediction evaluation; and 3) Zeroshot evaluation. We also propose a new dataset CompGuessWhat?! as an instance of this framework for evaluating the quality of learned neural representations, in particular concerning attribute grounding. To this end, we extend the original GuessWhat?! dataset by including a semantic layer on top of the perceptual one. Specifically, we enrich the Vi-sualGenome scene graphs associated with the GuessWhat?! images with abstract and situated attributes. By using diagnostic classifiers, we show that current models learn representations that are not expressive enough to encode object attributes (average F1 of 44.27). In addition, they do not learn strategies nor representations that are robust enough to perform well when novel scenes or objects are involved in gameplay (zero-shot best accuracy 50.06%).

show abstract

OmniNet: A unified architecture for multi-modal multi-task learning

Cited by 14 publications

References 17 publications

Multi-Task Learning with Deep Neural Networks: A Survey

Multi-Task Learning with Deep Neural Networks: A Survey

Exceeding the Limits of Visual-Linguistic Multi-Task Learning

CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning

Contact Info

Product

Resources

About