ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning

Aribandi, Vamsi; Tay, Yi; Schuster, Tal; Rao, Jinfeng; Zheng, Huaixiu; Mehta, Sanket Vaibhav; Zhuang, Honglei; Trần, Vinh Cao; Bahri, Dara; Ni, Jianmo; Gupta, J.P.; Huang, Kai; Ruder, Sebastian; Metzler, Donald

doi:10.48550/arxiv.2111.10952

Cited by 20 publications

(27 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, this work is also closely related to the theme of model unification where unified multi-task models have been recently popular due to immense potential (Raffel et al, 2019;Khashabi et al, 2020;Aribandi et al, 2021). Hence, the proposed DSI presents an opportunity to integrate discrete and disjoint search operations into end-to-end unified models -a unique capability that was not possible before.…”

Section: Related Workmentioning

confidence: 96%

Transformer Memory as a Differentiable Search Index

Tay¹,

Trần²,

Dehghani³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. To this end, we introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes. Experiments demonstrate that given appropriate design choices, DSI significantly outperforms strong baselines such as dual encoder models. Moreover, DSI demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup.

show abstract

Section: Related Workmentioning

confidence: 96%

Transformer Memory as a Differentiable Search Index

Tay¹,

Trần²,

Dehghani³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…How to aggregate performances? The multi-tasks setting has been investigated in recent works that provide benchmark of state-of-the-art models across a great variety of tasks (Rajpurkar et al, 2016;McCann et al, 2018;Conneau et al, 2018a;Zheng et al, 2021;Tay et al, 2020b), sometimes with more than fifty (Siddhant et al, 2020;Aribandi et al, 2021;Wei et al, 2021;Sanh et al, 2021). These papers provide tables of scores across the considered tasks, but the only non-qualitative way to compare systems consists in averaging the performances across tasks and then ranking systems according to their mean score values.…”

Section: Work In Progressmentioning

confidence: 99%

What are the best systems? New perspectives on NLP Benchmarking

Colombo¹,

Noiry²,

Irurozki³

et al. 2022

Preprint

View full text Add to dashboard Cite

In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (e.g. GLUE, EXTREM, SE-VAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on stateof-the-art systems than the mean-aggregation procedure while being both more reliable and robust.

show abstract

“…In addition, in the wake of the recent surge of interest in massively multitask few-shot NLP models (Min et al, 2021;Wei et al, 2021;Aribandi et al, 2021;Sanh et al, 2022;Karimi Mahabadi et al, 2021, inter alia), we also evaluate our latent-skill model on CrossFit (Ye et al, 2021). This benchmark recasts 160 NLP tasks (including QA, conditional text generation, classification, and other types such as regression) as textto-text generation problems.…”

Section: Fine-grained Skill Selectionmentioning

confidence: 99%

“…Multitask NLP Multitask learning for NLP has been an effective strategy for improving model performance in low-resource tasks and for quickly adapting to new, unseen tasks (Ruder et al, 2019;Liu et al, 2019;Min et al, 2021;Wei et al, 2021;Aribandi et al, 2021;Sanh et al, 2022;Karimi Mahabadi et al, 2021;Rusu et al, 2019), languages (Ponti et al, 2019), and modalities (Bugliarello et al, 2022). Liu et al (2019) adopt a multitask training strategy with a shared model and achieve impressive performance on GLUE.…”

Section: Related Workmentioning

confidence: 99%

Combining Modular Skills in Multitask Learning

Ponti¹,

Sordoni²,

Bengio³

et al. 2022

Preprint

View full text Add to dashboard Cite

A modular design encourages neural models to disentangle and recombine different facets of knowledge to generalise more systematically to new tasks. In this work, we assume that each task is associated with a subset of latent discrete skills from a (potentially small) inventory. In turn, skills correspond to parameter-efficient (sparse / lowrank) model parameterisations. By jointly learning these and a task-skill allocation matrix, the network for each task is instantiated as the average of the parameters of active skills. To favour non-trivial soft partitions of skills across tasks, we experiment with a series of inductive biases, such as an Indian Buffet Process prior and a twospeed learning rate. We evaluate our latentskill model on two main settings: 1) multitask reinforcement learning for grounded instruction following on 8 levels of the BabyAI platform; and 2) few-shot adaptation of pre-trained text-to-text generative models on CrossFit, a benchmark comprising 160 NLP tasks. We find that the modular design of a network significantly increases sample-efficiency in reinforcement learning and few-shot generalisation in supervised learning, compared to baselines with fully shared, task-specific, or conditionally generated parameters where knowledge is entangled across tasks. In addition, we show how discrete skills help interpretability, as they yield an explicit hierarchy of tasks.

show abstract

ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning

Cited by 20 publications

References 36 publications

Transformer Memory as a Differentiable Search Index

Transformer Memory as a Differentiable Search Index

What are the best systems? New perspectives on NLP Benchmarking

Combining Modular Skills in Multitask Learning

Contact Info

Product

Resources

About