2021
DOI: 10.48550/arxiv.2109.14076
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

RAFT: A Real-World Few-Shot Text Classification Benchmark

Abstract: Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don't directly answer this question. The RAFT benchmark (Realworld Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deploym… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
8
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(13 citation statements)
references
References 21 publications
0
8
0
Order By: Relevance
“…Our formulation of this task allows us to directly compare general purpose language models to special purpose systems on the same axis in order to assess a more realistic capability. Finally, we note that we can extend the analysis we do here to other economically valuable real-world tasks such as those in the recent Real-World Few-Shot Text-Classification (RAFT) benchmark [1].…”
Section: A3 Recommendation System Experimentsmentioning
confidence: 90%
“…Our formulation of this task allows us to directly compare general purpose language models to special purpose systems on the same axis in order to assess a more realistic capability. Finally, we note that we can extend the analysis we do here to other economically valuable real-world tasks such as those in the recent Real-World Few-Shot Text-Classification (RAFT) benchmark [1].…”
Section: A3 Recommendation System Experimentsmentioning
confidence: 90%
“…Finally, we demonstrate the benefits of pre-training the (IA) 3 parameters before fine-tuning [18,19]. Our overall recipe, which we dub "T-Few", attains significantly stronger performance than ICL (even against 16× larger models) and outperforms humans for the first time on the real-world few-shot learning benchmark RAFT [2] while requiring dramatically less compute and allowing for mixed-task batches during inference. To facilitate the use of T-Few on new problems as well as future research on PEFT, we release our code.…”
Section: Introductionmentioning
confidence: 90%
“…Evaluation of generalization capabilities can then be straightforwardly done by measuring performance on these held-out datasets. We also will later test T-Few's abilities in the RAFT benchmark [2] in section 4.3, a collection of unseen "real-world" few-shot tasks with no validation set and a held-out test set.…”
Section: Model and Datasetsmentioning
confidence: 99%
See 2 more Smart Citations