Deep Dominance - How to Properly Compare Deep Neural Models

Dror, Rotem; Shlomov, Segev; Reichart, Roi

doi:10.18653/v1/p19-1266

Cited by 73 publications

(61 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Deep neural networks' performance on NLP tasks is bound to exhibit large variance. Reimers and Gurevych (2017) and Dror et al (2019) stress the importance of reporting score distributions instead of a single score for fair(er) comparisons. Dodge et al (2020), Mosbach et al (2021), andZhang et al (2021) show that finetuning pretrained encoders with different random seeds yields performance with large variance.…”

Section: Background and Related Workmentioning

confidence: 99%

A Closer Look at Few-Shot Crosslingual Transfer: The Choice of Shots Matters

Zhao¹,

Zhu²,

Shareghi

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

Few-shot crosslingual transfer has been shown to outperform its zero-shot counterpart with pretrained encoders like multilingual BERT. Despite its growing popularity, little to no attention has been paid to standardizing and analyzing the design of few-shot experiments. In this work, we highlight a fundamental risk posed by this shortcoming, illustrating that the model exhibits a high degree of sensitivity to the selection of few shots. We conduct a largescale experimental study on 40 sets of sampled few shots for six diverse NLP tasks across up to 40 languages. We provide an analysis of success and failure cases of few-shot transfer, which highlights the role of lexical features. Additionally, we show that a straightforward full model finetuning approach is quite effective for few-shot transfer, outperforming several state-of-the-art few-shot approaches. As a step towards standardizing few-shot crosslingual experimental designs, we make our sampled few shots publicly available. 1 * Equal contribution.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

A Closer Look at Few-Shot Crosslingual Transfer: The Choice of Shots Matters

Zhao¹,

Zhu²,

Shareghi

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

show abstract

“…The results show that our structural KD approaches outperform the baselines in all the cases. Table 3 Dror et al (2019) with a significance level of 0.05 and find that the advantages of our structural KD approaches are significant. Please refer to Appendix for more detailed results.…”

Section: Resultsmentioning

confidence: 95%

“…In this section, we present detailed experimental results. (Dror et al, 2019), which is a high quality comparison between deep neural networks. We evaluate with a significance level of 0.05.…”

Section: Detailed Experimental Resultsmentioning

confidence: 99%

Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor

Wang

Jiang

Yan

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Knowledge distillation is a critical technique to transfer knowledge between models, typically from a large model (the teacher) to a more fine-grained one (the student). The objective function of knowledge distillation is typically the cross-entropy between the teacher and the student's output distributions. However, for structured prediction problems, the output space is exponential in size; therefore, the cross-entropy objective becomes intractable to compute and optimize directly. In this paper, we derive a factorized form of the knowledge distillation objective for structured prediction, which is tractable for many typical choices of the teacher and student models. In particular, we show the tractability and empirical effectiveness of structural knowledge distillation between sequence labeling and dependency parsing models under four different scenarios: 1) the teacher and student share the same factorization form of the output structure scoring function; 2) the student factorization produces more fine-grained substructures than the teacher factorization; 3) the teacher factorization produces more fine-grained substructures than the student factorization; 4) the factorization forms from the teacher and the student are incompatible. 1

show abstract

“…We also notice that the accuracy increment is relatively higher for all experiments on the WOS corpus than on DBpedia. A primary reason might be the number of documents in each dataset, as (Dror et al, 2019) over the seq2seq baseline with a significance level of 0.05. The amount of parameters of each combined strategies is up to seven million.…”

Section: Resultsmentioning

confidence: 99%