Flambé: A Customizable Framework for Machine Learning Experiments

Wohlwend, Jeremy; Matthews, Nicholas E.; Itzcovich, Ivan

doi:10.18653/v1/p19-3029

Cited by 2 publications

(1 citation statement)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…All models are trained using the Adam optimizer (Kingma and Ba, 2014) with an inversesquare-root learning rate scheduler and learning rate warmup (Vaswani et al, 2017). Our experiments were conducted using Flambé, a PyTorch-based model training and evaluation library (Wohlwend et al, 2019). More implementation details such as hyperparameter settings are provided in Appendix A.…”

Section: Methodsmentioning

confidence: 99%

Autoregressive Knowledge Distillation through Imitation Learning

Lin

Wohlwend

Chen

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

View full text Add to dashboard Cite

The performance of autoregressive models on natural language generation tasks has dramatically improved due to the adoption of deep, self-attentive architectures. However, these gains have come at the cost of hindering inference speed, making state-of-the-art models cumbersome to deploy in real-world, timesensitive settings. We develop a compression technique for autoregressive models that is driven by an imitation learning perspective on knowledge distillation. The algorithm is designed to address the exposure bias problem. On prototypical language generation tasks such as translation and summarization, our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation. Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model. 1

show abstract

Section: Methodsmentioning

confidence: 99%