Knowledge distillation (KD) is a powerful and widely applicable technique for the compression of deep learning models. The main idea of knowledge distillation is to transfer knowledge from a large teacher model to a small student model, where the attention mechanism has been intensively explored in regard to its great flexibility for managing different teacher-student architectures. However, existing attention-based methods usually transfer similar attention knowledge from the intermediate layers of deep neural networks, leaving the hierarchical structure of deep representation learning poorly investigated for knowledge distillation. In this paper, we propose a hierarchical multi-attention transfer framework (HMAT), where different types of attention are utilized to transfer the knowledge at different levels of deep representation learning for knowledge distillation. Specifically, position-based and channel-based attention knowledge characterize the knowledge from low-level and high-level feature representations respectively, and activation-based attention knowledge characterize the knowledge from both mid-level and high-level feature representations. Extensive experiments on three popular visual recognition tasks, image classification, image retrieval, and object detection, demonstrate that the proposed hierarchical multi-attention transfer or HMAT significantly outperforms recent state-of-the-art KD methods.
Knowledge distillation (KD), as an efficient and effective model compression technique, has been receiving considerable attention in deep learning. The key to its success is to transfer knowledge from a large teacher network to a small student one. However, most of the existing knowledge distillation methods consider only one type of knowledge learned from either instance features or instance relations via a specific distillation strategy in teacher-student learning. There are few works that explore the idea of transferring different types of knowledge with different distillation strategies in a unified framework. Moreover, the frequently used offline distillation suffers from a limited learning capacity due to the fixed teacher-student architecture. In this paper we propose a collaborative teacher-student learning via multiple knowledge transfer (CTSL-MKT) that prompts both self-learning and collaborative learning. It allows multiple students learn knowledge from both individual instances and instance relations in a collaborative way. While learning from themselves with self-distillation, they can also guide each other via online distillation. The experiments and ablation studies on four image datasets demonstrate that the proposed CTSL-MKT significantly outperforms the state-of-the-art KD methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.