2022
DOI: 10.48550/arxiv.2207.05221
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Language Models (Mostly) Know What They Know

Abstract: We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

3
63
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 57 publications
(66 citation statements)
references
References 4 publications
3
63
0
Order By: Relevance
“…On QuALITY, unaided humans do poorly, with human-model teams doing substantially better. These results are consistent with Kadavath et al (2022)'s report that our plain pretrained language models tend to be very well calibrated on multiple-choice question answering, but that their calibration degrades after RLHF training for helpfulness, which we use for all runs in this paper. We expect that explicit calibration training on the task would have improved these results for human participants and human-model teams (Lichtenstein and Fischhoff, 1980), which could potentially improve raw accuracy as well by better weighting votes across participants.…”
Section: Resultssupporting
confidence: 91%
“…On QuALITY, unaided humans do poorly, with human-model teams doing substantially better. These results are consistent with Kadavath et al (2022)'s report that our plain pretrained language models tend to be very well calibrated on multiple-choice question answering, but that their calibration degrades after RLHF training for helpfulness, which we use for all runs in this paper. We expect that explicit calibration training on the task would have improved these results for human participants and human-model teams (Lichtenstein and Fischhoff, 1980), which could potentially improve raw accuracy as well by better weighting votes across participants.…”
Section: Resultssupporting
confidence: 91%
“…Calibrated probability estimates match the true empirical frequencies of an outcome, and calibration is often used to evaluate the quality of uncertainty estimates provided by ML models. Recent works have observed that highly-accurate models that leverage pre-training are often well-calibrated [21,22,23]. However, we find that even pre-trained models are poorly calibrated when they are fine-tuned using DP-SGD.…”
Section: Related Workmentioning
confidence: 65%
“…In this work, we aim to apply this psychological self-improvement strategy for human behavior to the behavior of LLMs. Second, the emerging abilities of LLMs to perform self-validation and self-correction, as demonstrated in recent studies [30][31][32] , suggest the possibility of addressing this challenging problem using ChatGPT itself. Third, we draw inspiration from existing Jailbreaks, many of which bypass ChatGPT's moral alignment by guiding it into certain uncontrollable "modes" that will then generate harmful responses.…”
Section: Toxicmentioning
confidence: 99%
“…Recent studies have been exploring the capacity of large language models to validate and correct their own claims [30][31][32] . For instance, the prior work 31 investigates the ability of language models to evaluate the validity of their claims and predict their ability to answer questions, while the recent study 30 demonstrates the capacity of LLMs for moral correction.…”
Section: Related Workmentioning
confidence: 99%