Coresets for Near-Convex Functions

Tukan, Murad; Maalouf, Alaa; Feldman, Dan

doi:10.48550/arxiv.2006.05482

Cited by 2 publications

(3 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As for the general case of any p ≥ 1, (Tukan, Maalouf, and Feldman 2020) showed that the • p -SVD factorization always exists, and can be obtained using the Löwner ellipsoid. Theorem 2 (Variant of Theorem III (John 2014)).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank Approximation

Tukan,

Maalouf,

Weksler

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

A common technique for compressing a neural network is to compute the k-rank 2 approximation A k,2 of the matrix A ∈ R n×d that corresponds to a fully connected layer (or embedding layer). Here, d is the number of the neurons in the layer, n is the number in the next one, and A k,2 can be stored in O((n + d)k) memory instead of O(nd). This 2-approximation minimizes the sum over every entry to the power of p = 2 in the matrix A − A k,2 , among every matrix A k,2 ∈ R n×d whose rank is k. While it can be computed efficiently via SVD, the 2-approximation is known to be very sensitive to outliers ("far-away" rows). Hence, machine learning uses e.g. Lasso Regression, 1-regularization, and 1-SVM that use the 1-norm. This paper suggests to replace the k-rank 2 approximation by p, for p ∈ [1, 2]. We then provide practical and provable approximation algorithms to compute it for any p ≥ 1, based on modern techniques in computational geometry. Extensive experimental results on the GLUE benchmark for compressing BERT, DistilBERT, XLNet, and RoBERTa confirm this theoretical advantage. For example, our approach achieves 28% compression of RoBERTa's embedding layer with only 0.63% additive drop in the accuracy (without finetuning) in average over all tasks in GLUE, compared to 11% drop using the existing 2-approximation. Open code is provided for reproducing and extending our results.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Lemma 7 (Special case of Lemma 15 (Tukan, Maalouf, and Feldman 2020)). Let A ∈ R n×d be a matrix of full rank, p ≥ 1.…”

Section: Computing the Löwner Ellipsoidmentioning

confidence: 99%

Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank Approximation

Tukan,

Maalouf,

Weksler

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…the Manhattan distance which is the 1 -norm between a point x and its projection, i.e., x − x 1 or sum of differences between the corresponding entries, instead of sum of squared entries, as in the Euclidean distance x − x 2 in this paper. More generally, we may use the p distance x − x p , or even non-distance functions such as M-Estimators that can handle outliers (as in Tukan et al (2020a)) by replacing dist(p, x) with min {dist(p, x), t} where t > 0 is constant (threshold) that makes sure that far away points will not affect the overall sum too much.…”

Section: Generalizations and Extensionsmentioning

confidence: 99%

Deep Learning Meets Projective Clustering

Maalouf,

Lang,

Rus

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

A common approach for compressing NLP networks is to encode the embedding layer as a matrix A ∈ R n×d , compute its rank-j approximation A j via SVD, and then factor A j into a pair of matrices that correspond to smaller fully-connected layers to replace the original embedding layer. Geometrically, the rows of A represent points in R d , and the rows of A j represent their projections onto the jdimensional subspace that minimizes the sum of squared distances ("errors") to the points. In practice, these rows of A may be spread around k > 1 subspaces, so factoring A based on a single subspace may lead to large errors that turn into large drops in accuracy. Inspired by projective clustering from computational geometry, we suggest replacing this subspace by a set of k subspaces, each of dimension j, that minimizes the sum of squared distances over every point (row in A) to its closest subspace. Based on this approach, we provide a novel architecture that replaces the original embedding layer by a set of k small layers that operate in parallel and are then recombined with a single fully-connected layer. Extensive experimental results on the GLUE benchmark yield networks that are both more accurate and smaller compared to the standard matrix factorization (SVD). For example, we further compress DistilBERT by reducing the size of the embedding layer by 40% while incurring only a 0.5% average drop in accuracy over all nine GLUE tasks, compared to a 2.8% drop using the existing SVD approach. On RoBERTa we achieve 43% compression of the embedding layer with less than a 0.8% average drop in accuracy as compared to a 3% drop previously. Open code for reproducing and extending our results is provided.

show abstract

Coresets for Near-Convex Functions

Cited by 2 publications

References 34 publications

Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank Approximation

Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank Approximation

Deep Learning Meets Projective Clustering

Contact Info

Product

Resources

About