Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machinegenerated text? We run a study assessing nonexperts' ability to distinguish between humanand machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3-and humanauthored text at random chance level. We explore three approaches for quickly training evaluators to better identify GPT3-authored text (detailed instructions, annotated examples, and paired examples) and find that while evaluators' accuracy improved up to 55%, it did not significantly improve across the three domains. Given the inconsistent results across text domains and the often contradictory reasons evaluators gave for their judgments, we examine the role untrained human evaluations play in NLG evaluation and provide recommendations to NLG researchers for improving human evaluations of text generated from state-of-the-art models.
Productive high-titer infection by human immunodeficiency virus type 1 (HIV-1) requires the activation of target cells. Infection of quiescent peripheral CD4 lymphocytes by HIV-1 results in incomplete, labile reverse transcripts and lack of viral progeny formation. An interplay between Tat and p53 has previously been reported, where Tat inhibited the transcription of the p53 gene, which may aid in the development of AIDS-related malignancies, and p53 expression inhibited HIV-1 long terminal repeat transcription. Here, by using a well-defined and -characterized stress signal, gamma irradiation, we find that upon gamma irradiation, HIV-1-infected cells lose their G 1 /S checkpoints, enter the S phase inappropriately, and eventually apoptose. The loss of the G 1 /S checkpoint is associated with a loss of p21/Waf1 protein and increased activity of a major G 1 /S kinase, namely, cyclin E/cdk2. The p21/Waf1 protein, a known cyclin-dependent kinase inhibitor, interacts with the cdk2/cyclin E complex and inhibits progression of cells into S phase. We find that loss of the G 1 /S checkpoint in HIV-1-infected cells may in part be due to Tat's ability to bind p53 (a known activator of the p21/Waf1 promoter) and sequester its transactivation activity, as seen in both in vivo and in vitro transcriptionassays. The loss of p21/Waf1 in HIV-1-infected cells was specific to p21/Waf1 and did not occur with other KIP family members, such as p27 (KIP1) and p57 (KIP2). Finally, the advantage of a loss of the G 1 /S checkpoint for HIV-1 per se may be that it pushes the host cell into the S phase, which may then allow subsequent virus-associated processes, such as RNA splicing, transport, translation, and packaging of virion-specific genes, to occur.
History, Hayden White remarks, has no distinctively historical method, but borrows its models and methods from a variety of other disciplines. These disciplines, however, have varied over time. Latenineteenth-century German historiography looked to the rigorous procedures of the natural sciences to reconstruct the past “as it actually happened“; mid-twentieth-century historians turned to the social sciences, especially to anthropology and sociology, for their models and methods. More recently, historians' appropriation of (and experimentation with) concepts derived from literary and critical theory has occasioned much heated discussion within the field.
Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted.This issue has become more urgent, since neural NLG models have improved to the point where they can often no longer be distinguished based on the surfacelevel features that older metrics rely on. This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years. We summarize, categorize, and discuss how researchers have been addressing these issues and what their findings mean for the current state of model evaluations. Building on those insights, we lay out a long-term vision for NLG evaluation and propose concrete steps for researchers to improve their evaluation processes. Finally, we analyze 66 NLG papers from recent NLP conferences in how well they already follow these suggestions and identify which areas require more drastic changes to the status quo.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.