2021
DOI: 10.48550/arxiv.2107.08039
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Representation Consolidation for Training Expert Students

Zhizhong Li,
Avinash Ravichandran,
Charless Fowlkes
et al.

Abstract: Traditionally, distillation has been used to train a student model to emulate the input/output functionality of a teacher. A more useful goal than emulation, yet under-explored, is for the student to learn feature representations that transfer well to future tasks. However, we observe that standard distillation of task-specific teachers actually reduces the transferability of student representations to downstream tasks. We show that a multi-head, multi-task distillation method using an unlabeled proxy dataset … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 27 publications
0
2
0
Order By: Relevance
“…[63] addresses interference by de-conflicting gradients via projection. [35,36] use distillation to avoid interference, but they are limited to a retrained setting, either single-task multi-source or singlesource multi-task. Other works attempt to develop systematic techniques to determine which tasks should be trained together in a multi-task neural network to avoid harmful conflicts between non-affinitive tasks [1-3, 17, 34].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…[63] addresses interference by de-conflicting gradients via projection. [35,36] use distillation to avoid interference, but they are limited to a retrained setting, either single-task multi-source or singlesource multi-task. Other works attempt to develop systematic techniques to determine which tasks should be trained together in a multi-task neural network to avoid harmful conflicts between non-affinitive tasks [1-3, 17, 34].…”
Section: Related Workmentioning
confidence: 99%
“…X-Learner++. Inspired by [36], in the Expansion Stage, we add extra supervisions from single-task single-source pre-trained model in the form of hints besides the original supervision from labels of multiple data sources. This can be viewed as adding a pre-distillation process with multiple SSST teachers prior to training the expanded backbone.…”
Section: Variants Of X-learnermentioning
confidence: 99%