Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.565
|View full text |Cite
|
Sign up to set email alerts
|

All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text

Abstract: Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machinegenerated text? We run a study assessing nonexperts' ability to distinguish between humanand machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3-and humanauthored text at random chance level. We explore three approaches for quick… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

5
135
1
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
5

Relationship

2
8

Authors

Journals

citations
Cited by 140 publications
(142 citation statements)
references
References 28 publications
5
135
1
1
Order By: Relevance
“…Large-scale deep neural network models have an extraordinary capacity to generate linguistic continuations of natural language prompts (5,8). The models provide the probability of words given a context captured by preceded sentences that is similar to human predictions (14).…”
Section: Discussionmentioning
confidence: 99%
“…Large-scale deep neural network models have an extraordinary capacity to generate linguistic continuations of natural language prompts (5,8). The models provide the probability of words given a context captured by preceded sentences that is similar to human predictions (14).…”
Section: Discussionmentioning
confidence: 99%
“…Creative texts, such as stories, are less constrained than translated texts, but researchers continue to employ crowd workers to evaluate creative texts, often without evaluating reference texts (see Section 2). Previous studies have asked workers to choose from (Mori et al, 2019) or distinguish between human-written and machine-generated texts (Garbacea et al, 2019;Ippolito et al, 2020;Clark et al, 2021).…”
Section: Related Workmentioning
confidence: 99%
“…All those studies focus on asking (crowdsourced) human annotators to decide if a text was generated by a machine or a human. Clark et al (2021) points out that the high fluency of modern generation models, combined with a generally low expectation of what machines can accomplish, makes it hard to make this distinguished, even for lightly trained annotators.…”
Section: Related Workmentioning
confidence: 99%