Yimeng Wu scite author profile

Yimeng Wu

5Publications

43Citation Statements Received

62Citation Statements Given

How they've been cited

How they cite others

122

Affiliations

Shanghai Tongji Urban Planning and Design Institute, Nanjing Forestry University, Huawei Technologies (Sweden)

Publications

Order By: Most citations

ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Passban

Rezagholizadeh

et al. 2021

AAAI

View full text Add to dashboard Cite

Knowledge distillation is considered as a training and compression strategy in which two neural networks, namely a teacher and a student, are coupled together during training. The teacher network is supposed to be a trustworthy predictor and the student tries to mimic its predictions. Usually, a student with a lighter architecture is selected so we can achieve compression and yet deliver high-quality results. In such a setting, distillation only happens for final predictions whereas the student could also benefit from teacher’s supervision for internal components. Motivated by this, we studied the problem of distillation for intermediate layers. Since there might not be a one-to-one alignment between student and teacher layers, existing techniques skip some teacher layers and only distill from a subset of them. This shortcoming directly impacts quality, so we instead propose a combinatorial technique which relies on attention. Our model fuses teacher-side information and takes each layer’s significance into consideration, then it performs distillation between combined teacher layers and those of the student. Using our technique, we distilled a 12-layer BERT (Devlin et al. 2019) into 6-, 4-, and 2-layer counterparts and evaluated them on GLUE tasks (Wang et al. 2018). Experimental results show that our combinatorial approach is able to outperform other existing techniques.

show abstract

Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers

Passban

Rezagholizadeh

et al. 2020

View full text Add to dashboard Cite

With the growth of computing power neural machine translation (NMT) models also grow accordingly and become better. However, they also become harder to deploy on edge devices due to memory constraints. To cope with this problem, a common practice is to distill knowledge from a large and accurately-trained teacher network (T ) into a compact student network (S). Although knowledge distillation (KD) is useful in most cases, our study shows that existing KD techniques might not be suitable enough for deep NMT engines, so we propose a novel alternative. In our model, besides matching T and S predictions we have a combinatorial mechanism to inject layer-level supervision from T to S. In this paper, we target low-resource settings and evaluate our translation engines for Portuguese→English, Turkish→English, and English→German directions. Students trained using our technique have 50% fewer parameters and can still deliver comparable results to those of 12-layer teachers.

show abstract

ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Passban¹,

Wu²,

Rezagholizadeh³

et al. 2020

Preprint

View full text Add to dashboard Cite

Airflow-induced nanochannel orientation in mesoporous polymers and carbon films

Lü

2015

Microporous and Mesoporous Materials

View full text Add to dashboard Cite

Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers

Wu¹,

Passban

Rezagholizade³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yimeng Wu

ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers

ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Airflow-induced nanochannel orientation in mesoporous polymers and carbon films

Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers

Contact Info

Product

Resources

About