ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Passban, Peyman; Wu, Yimeng; Rezagholizadeh, Mehdi; Li, Qun

doi:10.1609/aaai.v35i15.17610

Cited by 64 publications

(24 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The original knowledge is the original large DL model, which is referred to as the teacher model. The knowledge distillation algorithm is used to transfer knowledge from the teacher model to the smaller student model using techniques such as Adversarial KD [130,131], Multi-Teacher KD [132,133,134], Cross-modal KD [135,136], Attentionbased KD [137,138,139,140], Lifelong KD [141,142] and Quantized KD [143,144]. Finally, the teacher-student architecture is used to train the student model.…”

Section: C) Knowledge Distillationmentioning

confidence: 99%

Enabling All In-Edge Deep Learning: A Literature Review

et al. 2023

View full text Add to dashboard Cite

In recent years, deep learning (DL) models have demonstrated remarkable achievements on non-trivial tasks such as speech recognition, image processing, and natural language understanding. One of the significant contributors to its success is the proliferation of end devices that acted as a catalyst to provide data for data-hungry DL models. However, computing DL training and inference is the main challenge. Usually, central cloud servers are used for the computation, but it opens up other significant challenges, such as high latency, increased communication costs, and privacy concerns. To mitigate these drawbacks, considerable efforts have been made to push the processing of DL models to edge servers (a mesh of computing devices near end devices). Moreover, the confluence point of DL and edge has given rise to edge intelligence (EI). International Electrotechnical Commission (IEC) defines EI as the concept where the data is acquired, stored, and processed utilizing edge computing with DL and advanced networking capabilities. Broadly, EI has six levels of categories based on where the training and inference of DL take place, e.g., cloud server, edge server and end devices. This survey paper focuses primarily on the fifth level of EI, called all in-edge level, where DL training and inference (deployment) are performed solely by edge servers. All in-edge is suitable when the end devices have low computing resources, e.g., Internet-of-Things, and other requirements such as latency and communication cost are important such as in mission-critical applications (e.g., health care). Besides, 5G/6G networks are envisioned to use all in-edge. Firstly, this paper presents all in-edge computing architectures, including centralized, decentralized, and distributed. Secondly, this paper presents enabling technologies, such as model parallelism, data parallelism, and split learning, which facilitates DL training and deployment at edge servers. Thirdly, model adaptation techniques based on model compression and conditional computation are described because the standard cloud-based DL deployment cannot be directly applied to all in-edge due to its limited computational resources. Fourthly, this paper discusses eleven key performance metrics to evaluate the performance of DL at all in-edge efficiently. Finally, several open research challenges in the area of all in-edge are presented. INDEX TERMS Artificial intelligence, all in-edge, deep learning, distributed systems, decentralized systems, edge intelligence I. INTRODUCTION T HE global community is increasingly becoming a datadriven environment in which end devices are generating vast quantities of data outside of the traditional data centers. International Telecommunication Union anticipates that global internet traffic per month will reach 607 Exabytes (EB) in 2025 and 5016 EB in 2030 [1]. This enormous amount of data has a positive impact on artificial intelligence (AI) applications. In particular, deep learning (DL) rely on the availability of large quantities of data for its d...

show abstract

Section: C) Knowledge Distillationmentioning

confidence: 99%

Enabling All In-Edge Deep Learning: A Literature Review

et al. 2023

View full text Add to dashboard Cite

show abstract

“…Recently, there have been several breakthroughs [ 10 , 11 , 12 , 13 ] related to the compression of BERT models in the pre-training stage, which is also called task-agnostic distillation [ 13 ]. To prevent re-building a pre-trained language model, researchers [ 14 , 15 ] are seeking an alternative that can directly distill knowledge from a teacher model for a downstream task, such as task-specific distillation [ 13 ]. In this way, given a downstream task, the teacher is the BERT model that was fine-tuned on the task, and the goal of the student model is to mimic the outputs of the teacher during the given task.…”

Section: Introductionmentioning

confidence: 99%

“…To fix this problem in PKD [ 14 ], instead of skipping some teacher layers, Passban et al [ 15 ] proposed Attention-Based Layer Projection for Knowledge Distillation (ALP-KD) to optimize the student model with all layers in the teacher model. However, each layer in BERT [ 1 ] plays a role in the NLP pipeline [ 16 ].…”

Section: Introductionmentioning

confidence: 99%

“…As a result, BERT’s sentence processing depends on these layer-by-layer sequential patterns [ 16 ]. In other words, the strategy of distilling higher layers of a teacher model to lower layers of a student model in ALP-KD [ 15 ] violates the nature of BERT.…”

Section: Introductionmentioning

confidence: 99%

“…To solve the problems related to both PKD [ 14 ] and ALP-KD [ 15 ] when improving model compression, in this work, we propose L ayer-wise A daptive D istillation (LAD). Inspired by the Highway Networks [ 17 ], we designed a Gate Network with multiple gate blocks in LAD.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

LAD: Layer-Wise Adaptive Distillation for BERT Model Compression

Lin

Chen

Kao

2023

Sensors

View full text Add to dashboard Cite

Recent advances with large-scale pre-trained language models (e.g., BERT) have brought significant potential to natural language processing. However, the large model size hinders their use in IoT and edge devices. Several studies have utilized task-specific knowledge distillation to compress the pre-trained language models. However, to reduce the number of layers in a large model, a sound strategy for distilling knowledge to a student model with fewer layers than the teacher model is lacking. In this work, we present Layer-wise Adaptive Distillation (LAD), a task-specific distillation framework that can be used to reduce the model size of BERT. We design an iterative aggregation mechanism with multiple gate blocks in LAD to adaptively distill layer-wise internal knowledge from the teacher model to the student model. The proposed method enables an effective knowledge transfer process for a student model, without skipping any teacher layers. The experimental results show that both the six-layer and four-layer LAD student models outperform previous task-specific distillation approaches during GLUE tasks.

show abstract

Knowledge Distillation‐Based Zero‐Shot Learning for Process Fault Diagnosis

Liu,

Huang,

Jia

2024

Advanced Intelligent Systems

View full text Add to dashboard Cite

Data‐driven deep learning is effective in diagnosing known faults, but not so well when new or unknown faults occur. With unknown faults as a zero‐shot learning problem, this article proposes a method for detecting and isolating unknown faults based on knowledge distillation within a teacher–student framework. Process data and image data are equivalent in their spatiotemporal dimensions, and convolutional neural networks are selected as the teacher model, pretrained on image data. Information under both normal and fault conditions is then effectively extracted from process data by the well‐trained teacher model. Subsequently, knowledge distillation is used to transfer only the data of normal conditions in the teacher model to the student model. When an unknown fault arises, there exist differences between the information extracted by the teacher model and the student model. Contributions of variables to faults are calculated by quantifying these differences through gradients, thereby isolating the unknown fault. Finally, compared with a series of baseline methods and two state‐of‐the‐art methods, the proposed method improves fault diagnosis accuracy by 3.08% to 26.13% in the Tennessee Eastman process and by 3.48% to 41.45% in the sour water treatment process. Additionally, the physical consistency of fault isolation is visually assessed.

show abstract

ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Cited by 64 publications

References 17 publications

Enabling All In-Edge Deep Learning: A Literature Review

Enabling All In-Edge Deep Learning: A Literature Review

LAD: Layer-Wise Adaptive Distillation for BERT Model Compression

Knowledge Distillation‐Based Zero‐Shot Learning for Process Fault Diagnosis

Contact Info

Product

Resources

About