2021
DOI: 10.48550/arxiv.2107.00061
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text

Abstract: Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machinegenerated text? We run a study assessing nonexperts' ability to distinguish between humanand machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3-and humanauthored text at random chance level. We explore three approaches for quick… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
22
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 25 publications
(31 citation statements)
references
References 16 publications
2
22
0
Order By: Relevance
“…6), using a linear regression model predicting trial-by-trial judgments as a function of categorical variables encoding sentence length (short, medium, long) and the source of the sentence (Wikipedia, GSN, MH, LSTM, or n-gram). First, we find that the naturalness of sentences from GSN declines by 14 points at longer sentence lengths, p < 0.001, while the naturalness of Wikipedia sentences is unaffected by length (in-7 See Clark et al (2021) for a discussion of the merits of phrasing the question in terms of naturalness instead of asking participants to judge whether it was produced by a human or machine. teraction term, p < 0.001), consistent with results reported by Ippolito et al (2020).…”
Section: Behavioral Resultsmentioning
confidence: 92%
“…6), using a linear regression model predicting trial-by-trial judgments as a function of categorical variables encoding sentence length (short, medium, long) and the source of the sentence (Wikipedia, GSN, MH, LSTM, or n-gram). First, we find that the naturalness of sentences from GSN declines by 14 points at longer sentence lengths, p < 0.001, while the naturalness of Wikipedia sentences is unaffected by length (in-7 See Clark et al (2021) for a discussion of the merits of phrasing the question in terms of naturalness instead of asking participants to judge whether it was produced by a human or machine. teraction term, p < 0.001), consistent with results reported by Ippolito et al (2020).…”
Section: Behavioral Resultsmentioning
confidence: 92%
“…Recent years have brought considerable improvements in the language generation capabilities of neural language models (Lewis et al 2020;Ji et al 2020;Brown et al 2020;Holtzman et al 2020), allowing users of these systems to pass off their generations as human-produced (Ippolito et al 2020). These advances have raised dual-use concerns as to whether these tools could be used to generate text for malicious purposes (Radford et al 2019;Bommasani et al 2021), which humans would struggle to detect (Clark et al 2021).…”
Section: Synthetic Disinformation Generationmentioning
confidence: 99%
“…One of the latest tools, RoFT (Real or Fake Text) tool (Dugan et al, 2020), is used to evaluate human detection showing that the text generation models are capable to fool humans by one or two sentences. A recent research (Clark et al, 2021) shows that training humans on evaluation task for GPT-3 generated text only improves the overall accuracy upto 50%.Despite the interest in measuring the ability of humans to detect automatically generated text, not much research has been conducted to develop automatic tools to distinguish machine generated text from human written text.…”
Section: Human Detection Of Machine Generated Textmentioning
confidence: 99%