We develop a fast, tractable technique called Net-Trim for simplifying a trained neural network. The method is a convex post-processing module, which prunes (sparsifies) a trained network layer by layer, while preserving the internal responses. We present a comprehensive analysis of Net-Trim from both the algorithmic and sample complexity standpoints, centered on a fast, scalable convex optimization program. Our analysis includes consistency results between the initial and retrained models before and after Net-Trim application and guarantees on the number of training samples needed to discover a network that can be expressed using a certain number of nonzero terms. Specifically, if there is a set of weights that uses at most s terms that can re-create the layer outputs from the layer inputs, we can find these weights from O(s log N/s) samples, where N is the input size. These theoretical results are similar to those for sparse regression using the Lasso, and our analysis uses some of the same recently-developed tools (namely recent results on the concentration of measure and convex analysis). Finally, we propose an algorithmic framework based on the alternating direction method of multipliers (ADMM), which allows a fast and simple implementation of Net-Trim for network pruning and compression.
Ps log(N/s).We also show that if the x p are subgaussian, then so are the y p . As a results, the theory can be applied layer-by-layer, yielding a sampling result for networks of arbitrary depth. (When we apply the algorithm in practice, the equality constraints in (1) are relaxed; this is discussed in detail in Section 3.1.) Along with these theoretical guarantees, Net-Trim offers state-of-the-art performance on realistic networks. In Section 6, we present some numerical experiments that show that compression factors between 10x and 50x (removing 90% to 98% of the connections) are possible with very little loss in test accuracy.Contributions and relations to previous work This paper provides a full description of the Net-Trim method from both a theoretical and algorithmic perspective. In Section 3, we present our convex formulation for sparsifying the weights in the linear layers of a network; we describe how the procedure can be applied layer-by-layer in a deep network either in parallel or serially (cascading the results), and present consistency bounds for both approaches. Section 4 presents our main theoretical result, stated precisely in Theorem 4. This result derives an upper bound on the number of data samples we need to reliably discover a layer that has at most s connections in its linear layer -we show that if the data samples are random, then these weights can be learned from O(s log N/s) samples. Mathematically, this result is comparable to the sample complexity bounds for the Lasso in performing sparse regression on a linear model (also known as the compressed sensing problem). Our analysis is based on the bowling scheme [30,24]; the main technical challenges are adapting this technique to the piecewise linear...