RAFT: A Real-World Few-Shot Text Classification Benchmark

Alex, Neel; Lifland, Eli; Tunstall, Lewis C.; Thakur, Abhishek; Pegah, Maham,; Riedel, C. Jess; Hine, Emmie; Ashurst, Carolyn; Sedille, Paul; Carlier, Alexis; Noetel, Michael; Stuhlmüller, Andreas

doi:10.48550/arxiv.2109.14076

Cited by 7 publications

(13 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our formulation of this task allows us to directly compare general purpose language models to special purpose systems on the same axis in order to assess a more realistic capability. Finally, we note that we can extend the analysis we do here to other economically valuable real-world tasks such as those in the recent Real-World Few-Shot Text-Classification (RAFT) benchmark [1].…”

Section: A3 Recommendation System Experimentsmentioning

confidence: 90%

Predictability and Surprise in Large Generative Models

Ganguli¹,

Hernandez²,

Lovitt³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large-scale pre-training has recently emerged as a technique for creating capable, generalpurpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property. Namely, these generative models have an unusual combination of predictable loss on a broad training distribution (as embodied in their "scaling laws"), and unpredictable specific capabilities, inputs, and outputs. We believe that the high-level predictability and appearance of useful capabilities drives rapid development of such models, while the unpredictable qualities make it difficult to anticipate the consequences of model deployment. We go through examples of how this combination can lead to socially harmful behavior with examples from the literature and real world observations, and we also perform two novel experiments to illustrate our point about harms from unpredictability. Furthermore, we analyze how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment. We conclude with a list of possible interventions the AI community may take to increase the chance of these models having a beneficial impact. We intend this paper to be useful to policymakers who want to understand and regulate AI systems, technologists who care about the potential policy impact of their work, and academics who want to analyze, critique, and potentially develop large generative models.

show abstract

Section: A3 Recommendation System Experimentsmentioning

confidence: 90%

Predictability and Surprise in Large Generative Models

Ganguli¹,

Hernandez²,

Lovitt³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Finally, we demonstrate the benefits of pre-training the (IA) 3 parameters before fine-tuning [18,19]. Our overall recipe, which we dub "T-Few", attains significantly stronger performance than ICL (even against 16× larger models) and outperforms humans for the first time on the real-world few-shot learning benchmark RAFT [2] while requiring dramatically less compute and allowing for mixed-task batches during inference. To facilitate the use of T-Few on new problems as well as future research on PEFT, we release our code.…”

Section: Introductionmentioning

confidence: 90%

“…Evaluation of generalization capabilities can then be straightforwardly done by measuring performance on these held-out datasets. We also will later test T-Few's abilities in the RAFT benchmark [2] in section 4.3, a collection of unseen "real-world" few-shot tasks with no validation set and a held-out test set.…”

Section: Model and Datasetsmentioning

confidence: 99%

“…We also found that fine-tuning T0 with T-Few on a single dataset only takes about a half an hour on a single NVIDIA A100 GPU. As of writing, this would cost about $17 USD using Microsoft Azure 2. Storage cost.…”

mentioning

confidence: 99%

“…However, we note that 4.2 MB is dwarfed by the on-disk size of the model checkpoints themselves -storing the (IA) 3 adaptation vectors for 10,000 tasks would take about as much space as the T0 checkpoint (41.5 GB).So far, we have evaluated performance on a collection of datasets that were not explicitly designed for benchmarking few-shot learning. To better evaluate T-Few's performance in the real world, we evaluated our approach on the RAFT benchmark[2]. RAFT consists of 11 "economically valuable" tasks that aim to mirror real-world applications.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

Liu¹,

Tam²,

Muqeeth³

et al. 2022

Preprint

View full text Add to dashboard Cite

Few-shot in-context learning (ICL) enables pre-trained language models to perform a previously-unseen task without any gradient-based training by feeding a small number of training examples as part of the input. ICL incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. Parameter-efficient fine-tuning (e.g. adapter modules, prompt tuning, sparse update methods, etc.) offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task. In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs. Along the way, we introduce a new parameter-efficient fine-tuning method called (IA) 3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. We also propose a simple recipe based on the T0 model [1] called T-Few that can be applied to new tasks without task-specific tuning or modifications. We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark [2], attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute. All of the code used in our experiments is publicly available. 1 * Equal contribution. 1 https://github.com/r-three/t-few Preprint. Under review.

show abstract

FederatedScope-GNN: Towards a Unified, Comprehensive and Efficient Package for Federated Graph Learning

Wang

Kuang

Xie

et al. 2022

Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

Large language models (LLMs) have demonstrated great capabilities in various natural language understanding and generation tasks. Platforms such as Hugging Face facilitate access and utilization of the pre-trained LLMs for different entities, ranging from computer science researchers to users with little machine learning background. Different entities can further improve the performance of those LLMs on their specific downstream tasks by fine-tuning LLMs. When several entities have similar interested tasks, but their local data cannot be shared directly because of privacy concerns regulations, federated learning (FL) is a mainstream solution to leverage the data of different entities. Besides avoiding direct data sharing, FL can also achieve rigorous data privacy protection, model intelligent property protection, and model customization via composition with different techniques. However, fine-tuning LLMs in federated learning settings still lacks adequate support from the existing FL frameworks because it has to deal with optimizing the consumption of significant communication and computational resources, various data preparation for different tasks, and distinct information protection demands. This paper first discusses these challenges of federated fine-tuning LLMs in detail, and introduces our implemented package FederatedScope-LLM (FS-LLM) as a main contribution, which consists of the following components: (1) we build a complete end-to-end benchmarking pipeline, automizing the processes of dataset preprocessing, federated fine-tuning execution or simulation, and performance evaluation on federated LLM fine-tuning with different capability demonstration purposes; (2) we provide comprehensive and off-the-shelf federated parameterefficient fine-tuning (PEFT) algorithm implementations and versatile programming interfaces for future extension to enhance the capabilities of LLMs in FL scenarios with low communication and computation costs, even without accessing the full model (e.g., closed-source LLMs); (3) we adopt several accelerating operators and resource-efficient operators for fine-tuning LLMs with limited resources and the flexible pluggable sub-routines for interdisciplinary study (e.g., LLMs in personalized FL). We conduct extensive and reproducible experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-theart parameter-efficient fine-tuning algorithms in a federated setting, which also yields many valuable insights into federated fine-tuning LLMs for the research community. To facilitate further research and adoption, we release FS-LLM at https://github.com/alibaba/FederatedScope/tree/llm. 1

show abstract

RAFT: A Real-World Few-Shot Text Classification Benchmark

Cited by 7 publications

References 21 publications

Predictability and Surprise in Large Generative Models

Predictability and Surprise in Large Generative Models

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

FederatedScope-GNN: Towards a Unified, Comprehensive and Efficient Package for Federated Graph Learning

Contact Info

Product

Resources

About