Statistical Significance Testing for Natural Language Processing

Dror, Rotem; Peled-Cohen, Lotem; Shlomov, Segev; Reichart, Roi

doi:10.2200/s00994ed1v01y202002hlt045

Cited by 18 publications

(14 citation statements)

References 80 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Notably, these improvements arise from training on merely three relations, meaning that the model improved its consistency ability and generalized to new relations. We measure the statistical significance of our method compared to the BERT baseline, using McNemar's test (following Dror et al [2018Dror et al [ , 2020) and find all results to be significant (p 0.01). We also perform an ablation study to quantify the utility of the different components.…”

Section: Improved Consistency Resultsmentioning

confidence: 99%

Measuring and Improving Consistency in Pretrained Language Models

Elazar

Kassner²,

Ravfogel

et al. 2021

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

Consistency of a model—that is, the invariance of its behavior under meaning-preserving alternations in its input—is a highly desirable property in natural language processing. In this paper we study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge? To this end, we create ParaRel🤘, a high-quality resource of cloze-style query English paraphrases. It contains a total of 328 paraphrases for 38 relations. Using ParaRel🤘, we show that the consistency of all PLMs we experiment with is poor— though with high variance between relations. Our analysis of the representational spaces of PLMs suggests that they have a poor structure and are currently not suitable for representing knowledge robustly. Finally, we propose a method for improving model consistency and experimentally demonstrate its effectiveness.1

show abstract

Section: Improved Consistency Resultsmentioning

confidence: 99%

Measuring and Improving Consistency in Pretrained Language Models

Elazar

Kassner²,

Ravfogel

et al. 2021

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…In order to show that the results are not coincidental, we test the statistical significance of our model. We follow the nonparametric Pitman's permutation test (Dror et al, 2018) and observe that our model is statistically significant when the significance level (α) is taken to be 0.05. Note that this holds true for all metric on both the datasets except ROUGE-2 on ParaNMT-small.…”

Section: Semantic Preservation Andmentioning

confidence: 99%

Syntax-Guided Controlled Generation of Paraphrases

Kumar

Ahuja

Vadapalli

et al. 2020

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

Given a sentence (e.g., “I like mangoes”) and a constraint (e.g., sentiment flip), the goal of controlled text generation is to produce a sentence that adapts the input sentence to meet the requirements of the constraint (e.g., “I hate mangoes”). Going beyond such simple constraints, recent work has started exploring the incorporation of complex syntactic-guidance as constraints in the task of controlled paraphrase generation. In these methods, syntactic-guidance is sourced from a separate exemplar sentence. However, these prior works have only utilized limited syntactic information available in the parse tree of the exemplar sentence. We address this limitation in the paper and propose Syntax Guided Controlled Paraphraser (SGCP), an end-to-end framework for syntactic paraphrase generation. We find that Sgcp can generate syntax-conforming sentences while not compromising on relevance. We perform extensive automated and human evaluations over multiple real-world English language datasets to demonstrate the efficacy of Sgcp over state-of-the-art baselines. To drive future research, we have made Sgcp’s source code available. 1

show abstract

“…Notably, these improvements arise from training on merely three relations, meaning that the model improved its consistency ability and generalized to new relations. We measure the statistical significance of our method compared to the BERT baseline, using Mc-Nemar's test (following Dror et al (2018Dror et al ( , 2020) and find all results to be significant (pval 0.01). We also perform an ablation study to quantify the utility of the different components.…”

Section: Improved Consistency Resultsmentioning

confidence: 99%

Measuring and Improving Consistency in Pretrained Language Models

Elazar

Kassner²,

Ravfogel

et al. 2021

Preprint

View full text Add to dashboard Cite

Consistency of a model -that is, the invariance of its behavior under meaning-preserving alternations in its input -is a highly desirable property in natural language processing. In this paper we study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge? To this end, we create PARAREL , a high-quality resource of cloze-style query English paraphrases. It contains a total of 328 paraphrases for thirty-eight relations. Using PARAREL , we show that the consistency of all PLMs we experiment with is poor -though with high variance between relations. Our analysis of the representational spaces of PLMs suggests that they have a poor structure and are currently not suitable for representing knowledge in a robust way. Finally, we propose a method for improving model consistency and experimentally demonstrate its effectiveness. 1

show abstract

Statistical Significance Testing for Natural Language Processing

Cited by 18 publications

References 80 publications

Measuring and Improving Consistency in Pretrained Language Models

Measuring and Improving Consistency in Pretrained Language Models

Syntax-Guided Controlled Generation of Paraphrases

Measuring and Improving Consistency in Pretrained Language Models

Contact Info

Product

Resources

About