Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank Approximation

Tukan, Murad; Maalouf, Alaa; Weksler, Matan; Feldman, Dan

doi:10.48550/arxiv.2009.05647

Cited by 4 publications

(4 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly to the original variant (the case of squared distance), we can use the EM method to obtain a "good guess" for a solution of this problem, however, the EM method requires an algorithm that solves the problem for the case of k = 1, i.e., computing the subspace that minimizes the sum of (non-squared) distances from those columns (for the sum of squared distances case, SVD is this algorithm). Unfortunately, there is only approximation algorithms (Clarkson and Woodruff, 2015;Tukan et al, 2020) for this case, and the deterministic versions are expensive in terms of running time.…”

Section: A2 Clustering Methodsmentioning

confidence: 99%

Compressing Neural Networks: Towards Determining the Optimal Layer-wise Decomposition

Liebenwein¹,

Maalouf²,

Oren³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We present a novel global compression framework for deep neural networks that automatically analyzes each layer to identify the optimal per-layer compression ratio, while simultaneously achieving the desired overall compression. Our algorithm hinges on the idea of compressing each convolutional (or fully-connected) layer by "slicing" its channels into multiple groups and decomposing each group via low-rank decomposition. At the core of our algorithm is the derivation of layer-wise error bounds from the Eckart-Young-Mirsky theorem. We then leverage these bounds to frame the compression problem as an optimization problem where we wish to minimize the maximum compression error across layers and propose an efficient algorithm towards a solution. Our experiments indicate that our method outperforms existing low-rank compression approaches across a wide range of networks and data sets. We believe that our results open up new avenues for future research into the global performance-size trade-offs of modern neural networks.

show abstract

Section: A2 Clustering Methodsmentioning

confidence: 99%

Compressing Neural Networks: Towards Determining the Optimal Layer-wise Decomposition

Liebenwein¹,

Maalouf²,

Oren³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Knowledge Distillation (KD) Hinton et al (2015) is leveraged in some works (Jiao et al (2019); Sanh et al (2019)) to bridge the gap between a compact model and BERT. Further, many works Mao et al (2020); Tukan et al (2020) use Low rank matrix factorization to deal with the issue. The authors of ROSITA (Liu et al (2021)) outline a methodology to combine weight pruning, KD and low rank factorization.…”

Section: Compressionmentioning

confidence: 99%

BERMo: What can BERT learn from ELMo?

Kodge¹,

Roy²

2021

Preprint

View full text Add to dashboard Cite

We propose BERMo, an architectural modification to BERT, which makes predictions based on a hierarchy of surface, syntactic and semantic language features. We use linear combination scheme proposed in Embeddings from Language Models (ELMo) to combine the scaled internal representations from different network depths. Our approach has two-fold benefits: (1) improved gradient flow for the downstream task as every layer has a direct connection to the gradients of the loss function and (2) increased representative power as the model no longer needs to copy the features learned in the shallower layer which are necessary for the downstream task. Further, our model has a negligible parameter overhead as there is a single scalar parameter associated with each layer in the network. Experiments on the probing task from SentEval dataset show that our model performs up to 4.65% better in accuracy than the baseline with an average improvement of 2.67% on the semantic tasks. When subject to compression techniques, we find that our model enables stable pruning for compressing small datasets like SST-2, where the BERT model commonly diverges. We observe that our approach converges 1.67× and 1.15× faster than the baseline on MNLI and QQP tasks from GLUE dataset. Moreover, our results show that our approach can obtain better parameter efficiency for penalty based pruning approaches on QQP task.

show abstract

“…The only difference is that the SVD computation of the optimal subspace for a cluster of points (k = 1) should be replaced by more involved approximation algorithm for computing the subspace that minimizes sum over distances to the power of q; see e.g. Tukan et al (2020b); Clarkson and Woodruff (2015).…”

Section: Generalizations and Extensionsmentioning

confidence: 99%

Deep Learning Meets Projective Clustering

Maalouf,

Lang,

Rus

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

A common approach for compressing NLP networks is to encode the embedding layer as a matrix A ∈ R n×d , compute its rank-j approximation A j via SVD, and then factor A j into a pair of matrices that correspond to smaller fully-connected layers to replace the original embedding layer. Geometrically, the rows of A represent points in R d , and the rows of A j represent their projections onto the jdimensional subspace that minimizes the sum of squared distances ("errors") to the points. In practice, these rows of A may be spread around k > 1 subspaces, so factoring A based on a single subspace may lead to large errors that turn into large drops in accuracy. Inspired by projective clustering from computational geometry, we suggest replacing this subspace by a set of k subspaces, each of dimension j, that minimizes the sum of squared distances over every point (row in A) to its closest subspace. Based on this approach, we provide a novel architecture that replaces the original embedding layer by a set of k small layers that operate in parallel and are then recombined with a single fully-connected layer. Extensive experimental results on the GLUE benchmark yield networks that are both more accurate and smaller compared to the standard matrix factorization (SVD). For example, we further compress DistilBERT by reducing the size of the embedding layer by 40% while incurring only a 0.5% average drop in accuracy over all nine GLUE tasks, compared to a 2.8% drop using the existing SVD approach. On RoBERTa we achieve 43% compression of the embedding layer with less than a 0.8% average drop in accuracy as compared to a 3% drop previously. Open code for reproducing and extending our results is provided.

show abstract

Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank Approximation

Cited by 4 publications

References 36 publications

Compressing Neural Networks: Towards Determining the Optimal Layer-wise Decomposition

Compressing Neural Networks: Towards Determining the Optimal Layer-wise Decomposition

BERMo: What can BERT learn from ELMo?

Deep Learning Meets Projective Clustering

Contact Info

Product

Resources

About