Layer Pruning on Demand with Intermediate CTC

Lee, Jaesong; Kang, Jingu; Watanabe, Shinji

doi:10.21437/interspeech.2021-1171

Cited by 7 publications

(5 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Prior research for Transformer based speech processing models has largely evolved into two categories: 1) architecture compression methods that aim to minimize the Transformer model structural redundancy measured by their depth, width, sparsity, or their combinations using techniques such as pruning [8][9][10], low-rank matrix factorization [11,12] and distillation [13,14]; and 2) low-bit quantization approaches that use either uniform [15][16][17][18], or mixed precision [12,19] settings. A combination of both architecture compression and low-bit quantization approaches has also been studied to produce larger model compression ratios [12].…”

Section: Introductionmentioning

confidence: 99%

“…The commonly adopted approach requires each target compressed system with the desired size to be individually constructed, for example, in [14,15,17] for Conformer models, and similarly for SSL foundation models such as DistilHuBERT [23], FitHuBERT [24], DPHuBERT [31], PARP [20], and LightHuBERT [30] (no more than 3 systems of varying complexity were built). 2) limited scope of system complexity attributes covering only a small subset of architecture hyper-parameters based on either network depth or width alone [8,9,11,35,36], or both [10,13,14,37], while leaving out the task of low-bit quantization, or vice versa [15][16][17][18][19][32][33][34]. This is particularly the case with the recent HuBERT model distillation research [23][24][25][28][29][30][31] that are focused on architectural compression alone.…”

Section: Introductionmentioning

confidence: 99%

“…This is particularly the case with the recent HuBERT model distillation research [23][24][25][28][29][30][31] that are focused on architectural compression alone. 3) restricted application to either standard supervised-learning based Transformers trained on a limited quantity of data [8][9][10][11][12][13][14][15][16][17][18][19] or their SSL foundation models alone [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34]. Hence, a more comprehensive study on both model types is desired to offer further insights into the efficacy of deploying current compression and quantization techniques on speech foundation models such as Whisper [38].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Identity of university Chinese heritage language learners in Hong Kong

Li¹,

李蓁²

View full text Add to dashboard Cite

The core of out-of-distribution (OOD) detection is to learn the in-distribution (ID) representation, which is distinguishable from OOD samples. Previous work applied recognition-based methods to learn the ID features, which tend to learn shortcuts instead of comprehensive representations. In this work, we find surprisingly that simply using reconstruction-based methods could boost the performance of OOD detection significantly. We deeply explore the main contributors of OOD detection and find that reconstructionbased pretext tasks have the potential to provide a generally applicable and efficacious prior, which benefits the model in learning intrinsic data distributions of the ID dataset. Specifically, we take Masked Image Modeling as a pretext task for our OOD detection framework (MOOD). Without bells and whistles, MOOD outperforms previous SOTA of one-class OOD detection by 5.7%, multi-class OOD detection by 3.0%, and near-distribution OOD detection by 2.1%. It even defeats the 10-shot-per-class outlier exposure OOD detection, although we do not include any OOD samples for our detection.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Identity of university Chinese heritage language learners in Hong Kong

Li¹,

李蓁²

View full text Add to dashboard Cite

show abstract

“…However, it comes at the cost of higher computational resources and memory consumption. It is hypothesized that some of the many layers might be redundant and have little contribution to the overall system per-formance [26]. This motivates the present study to inspect the redundancy among layers and perform layer-level structured pruning, i.e., layer pruning (LP), for simplifying deep models.…”

Section: Introductionmentioning

confidence: 99%

“…Inspired by [26], if two layers' outputs are similar, the layers between them are assumed to be redundant and can be discarded. In the present study, we propose the Correlation Measure based Fast Search on Layer Pruning (CoMFLP).…”

Section: Introductionmentioning

confidence: 99%

CoMFLP: Correlation Measure Based Fast Search on ASR Layer Pruning

Liu¹,

Peng²,

Lee³

2023

Interspeech 2023

View full text Add to dashboard Cite

Transformer-based speech recognition (ASR) model with deep layers exhibited significant performance improvement. However, the model is inefficient for deployment on resourceconstrained devices. Layer pruning (LP) is a commonly used compression method to remove redundant layers. Previous studies on LP usually identify the redundant layers according to a task-specific evaluation metric. They are time-consuming for models with a large number of layers, even in a greedy search manner. To address this problem, we propose CoM-FLP, a fast search LP algorithm based on correlation measure. The correlation between layers is computed to generate a correlation matrix, which identifies the redundancy among layers. The search process is carried out in two steps: (1) coarse search: to determine top K candidates by pruning the most redundant layers based on the correlation matrix; (2) fine search: to select the best pruning proposal among K candidates using a task-specific evaluation metric. Experiments on an ASR task show that the pruning proposal determined by CoMFLP outperforms existing LP methods while only requiring constant time complexity. The code is publicly available at https://github.com/louislau1129/CoMFLP.

show abstract