A Systematic Review of Reproducibility Research in Natural Language Processing

Belz, Anja; Agarwal, Shubham; Shimorina, Anastasia; Reiter, Ehud

doi:10.18653/v1/2021.eacl-main.29

Cited by 77 publications

(58 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Our contributions are threefold. We first complete a reproduction of state-of-the-art cross-topic stance detection work (Reimers et al, 2019), as reproduction has repeatedly shown to be relevant for NLP (Fokkens et al, 2013;Cohen et al, 2018;Belz et al, 2021). The reproduction is largely successful: we obtain similar numeric results.…”

Section: Introductionmentioning

confidence: 91%

“…We adopt the definition of reproduction by Belz et al (2021): repeating the experiments as described in the earlier study, with the exact same data and software. We analyze our reproduced results according to the three dimensions of repro-duction proposed by Cohen et al ( 2018): whether we find either the same or different (1) (numeric) values, (2) findings, and (3) conclusions as the earlier study.…”

Section: Generalization To New Topicsmentioning

confidence: 99%

See 1 more Smart Citation

Is Stance Detection Topic-Independent and Cross-topic Generalizable? -- A Reproduction Study

Reuver¹,

Verberne²,

Morante³

et al. 2021

Preprint

View full text Add to dashboard Cite

Cross-topic stance detection is the task to automatically detect stances (pro, against, or neutral) on unseen topics. We successfully reproduce state-of-the-art cross-topic stance detection work (Reimers et al., 2019), and systematically analyze its reproducibility. Our attention then turns to the cross-topic aspect of this work, and the specificity of topics in terms of vocabulary and socio-cultural context. We ask: To what extent is stance detection topicindependent and generalizable across topics? We compare the model's performance on various unseen topics, and find topic (e.g. abortion, cloning), class (e.g. pro, con), and their interaction affecting the model's performance. We conclude that investigating performance on different topics, and addressing topic-specific vocabulary and context, is a future avenue for cross-topic stance detection.

show abstract

Section: Introductionmentioning

confidence: 91%

Section: Generalization To New Topicsmentioning

confidence: 99%

Is Stance Detection Topic-Independent and Cross-topic Generalizable? -- A Reproduction Study

Reuver¹,

Verberne²,

Morante³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In addition, raw annotations can shed light on the difficulty of the task and nature of the data: they can be aggregated in multiple ways (Oortwijn et al, 2021), or used to account for annotator bias in model training (Beigman and Beigman Klebanov, 2009). Finally, releasing annotated judgments makes it possible to replicate and further analyze the evaluation outcome (Belz et al, 2021).…”

Section: Releasing Annotationsmentioning

confidence: 99%

“…For ST, the lack of detail and clarity in describing evaluation protocols makes it difficult to improve them, as has been pointed out for other NLG tasks by Shimorina and Belz (2021) who propose evaluation datasheets for clear documentation of human evaluations, Lee (2020) and van der who propose best practices guidelines, and Belz et al ( , 2021 who raise concerns regarding reproducibility. This issue is particularly salient for ST tasks where stylistic changes are defined implicitly by data (Jin et al, 2021) and where the instructions given to human judges for style transfer might be the only explicit characterization of the style dimension targeted.…”

Section: Standardizing Evaluation Protocolsmentioning

confidence: 99%

“…Inspired by recent critiques of human evaluations of Natural Language Generation (NLG) systems Lee, 2020;Belz et al, , 2021Shimorina and Belz, 2021), we conduct a structured review of human evaluation for neural style transfer systems as their evaluation is primarily based on human judgments. Concretely, out of the 97 papers we reviewed, 69 of them resort to human evaluation (Figure 1), where it is treated either as a substitute for automatic metrics or as a more reliable evaluation.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Review of Human Evaluation for Style Transfer

Briakou¹,

Agrawal²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper reviews and summarizes human evaluation practices described in 97 style transfer papers with respect to three main evaluation aspects: style transfer, meaning preservation, and fluency. In principle, evaluations by human raters should be the most reliable. However, in style transfer papers, we find that protocols for human evaluations are often underspecified and not standardized, which hampers the reproducibility of research in this field and progress toward better human and automatic evaluation methods.

show abstract

Knowledge Acquired by Foundation Models

Paaß

Giesselbach

2023

Artificial Intelligence: Foundations, Theory, and Algorithms

View full text Add to dashboard Cite

During pre-training, a Foundation Model is trained on an extensive collection of documents and learns the distribution of words in correct and fluent language. In this chapter, we investigate the knowledge acquired by PLMs and the larger Foundation Models. We first discuss the application of Foundation Models to specific benchmarks to test knowledge in a large number of areas and examine if the models are able to derive correct conclusions from the content. Another group of tests assesses Foundation Models by completing text and by applying specific probing classifiers that consider syntactic knowledge, semantic knowledge, and logical reasoning separately. Finally, we investigate if the benchmarks are reliable and reproducible, i.e. whether they actually test the targeted properties and yield the same performance values when repeated by other researchers.

show abstract

A Systematic Review of Reproducibility Research in Natural Language Processing

Cited by 77 publications

References 35 publications

Is Stance Detection Topic-Independent and Cross-topic Generalizable? -- A Reproduction Study

Is Stance Detection Topic-Independent and Cross-topic Generalizable? -- A Reproduction Study

A Review of Human Evaluation for Style Transfer

Knowledge Acquired by Foundation Models

Contact Info

Product

Resources

About