2018
DOI: 10.48550/arxiv.1811.08051
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning without Memorizing

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
7
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(7 citation statements)
references
References 0 publications
0
7
0
Order By: Relevance
“…In [5] previous knowledge is distilled directly from the last trained model. In [39] an attention distillation loss is introduced as an information preserving penalty for the classifiers' attention maps. In [6] the current model distills knowledge from all previous model snapshots, of which a pruned version is saved.…”
Section: Related Workmentioning
confidence: 99%
“…In [5] previous knowledge is distilled directly from the last trained model. In [39] an attention distillation loss is introduced as an information preserving penalty for the classifiers' attention maps. In [6] the current model distills knowledge from all previous model snapshots, of which a pruned version is saved.…”
Section: Related Workmentioning
confidence: 99%
“…Deep learning models are prone to catastrophic forgetting [20,30,48], i.e., training a model with new information interferes with previously learned knowledge and typically greatly degrades performance. This phenomenon has been widely studied in image classification task and most of the current techniques fall into the following categories [10,48]: regularization approaches [5,32,73,13,36], dynamic architectures [69,64,35], parameter isolation [17,53,40] and replay-based methods [66,46,55,26]. Regularization-based approaches are by far the most widely employed and mainly come in two flavours, i.e., penalty computing and knowledge distillation [25].…”
Section: Related Workmentioning
confidence: 99%
“…Penalty computing approaches [73,32,32] protect important weights inside the models to prevent forgetting. Knowledge distillation [52,66,36,13] relies on a teacher (old) model transferring or remembering knowledge related to previous tasks to a student model which is trained to learn also additional tasks. Parameter isolation approaches [40,39] reserve a subset of weights for a specific task to avoid degradation.…”
Section: Related Workmentioning
confidence: 99%
“…Li et al [17] Learn Without Forgetting (LWF) by distilling the knowledge from the last model. Dhar et al [4] introduce Grad-CAM [27] in the loss function. Rebuffi et al [25] introduced exemplar set for the old data and match previous logits through distillation.…”
Section: Related Workmentioning
confidence: 99%
“…Treating the most previous model as the teacher and applying this distillation sequentially helps preserve historical information, especially when no previous exemplar set is stored, which is the protocol for prior methods [25,2,17,4]. However, the historical information will be gradually lost in this sequential pipeline as the current model must reconstruct all the prior information from the penultimate model alone.…”
Section: Multi-model Distillationmentioning
confidence: 99%