BinaryRelax: A Relaxation Approach For Training Deep Neural Networks With Quantized Weights

Yin, Pengbin; Zhang, Shuai; Lyu, Jiancheng; Osher, Stanley; Qi, Yingyong; Xin, Jack

doi:10.48550/arxiv.1801.06313

Cited by 14 publications

(27 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The DNN representation of linear finite element functions opens a door for theoretical explanation and possible improvement on the application of the quantized weights in a convolution neural networks (see [17]).…”

Section: Discussionmentioning

confidence: 99%

“…In this section, we will show the rationality of low bit-width models with respect to approximation properties in some sense by investigating that a special type of ReLU DNN model can also recover all CPWL functions. In [17], an incremental network quantization strategy is proposed for transforming a general trained CNN into some low bit-width version in which there parameters are all zeros or powers of two. Mathematically speaking, low bit-width DNN model is defined as:…”

Section: Low Bit-width Dnn Modelsmentioning

confidence: 99%

“…In real applications, many efforts have been made to compress the deep neural networks by using heavily quantized weights, c.f. [17]. Especially, binary and ternary weight models not only give high model compression rate, but also eliminate the need of most floating-point multiplications during interface phase.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ReLU Deep Neural Networks and Linear Finite Elements

He,

Li,

et al. 2018

Preprint

View full text Add to dashboard Cite

In this paper, we investigate the relationship between deep neural networks (DNN) with rectified linear unit (ReLU) function as the activation function and continuous piecewise linear (CPWL) functions, especially CPWL functions from the simplicial linear finite element method (FEM). We first consider the special case of FEM. By exploring the DNN representation of its nodal basis functions, we present a ReLU DNN representation of CPWL in FEM. We theoretically establish that at least 2 hidden layers are needed in a ReLU DNN to represent any linear finite element functions in Ω ⊆ R d when d ≥ 2. Consequently, for d = 2, 3 which are often encountered in scientific and engineering computing, the minimal number of two hidden layers are necessary and sufficient for any CPWL function to be represented by a ReLU DNN.Then we include a detailed account on how a general CPWL in R d can be represented by a ReLU DNN with at most log 2 (d + 1) hidden layers and we also give an estimation of the number of neurons in DNN that are needed in such a representation. Furthermore, using the relationship between DNN

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Low Bit-width Dnn Modelsmentioning

confidence: 99%

See 1 more Smart Citation

ReLU Deep Neural Networks and Linear Finite Elements

He,

Li,

et al. 2018

Preprint

View full text Add to dashboard Cite

show abstract

“…To this end, we follow (Cai et al, 2017) and resort to a modified batch normalization layer (Ioffe & Szegedy, 2015) without the scale and shift, whose output components approximately follow a unit Gaussian distribution. Then the α that fits the input of activation layer the best can be pre-computed by a variant of Lloyd's algorithm (Lloyd, 1982;Yin et al, 2018a) applied to a set of simulated 1-D half-Gaussian data. After determining the α, it will be fixed during the whole training process.…”

Section: Methodsmentioning

confidence: 99%

“…It calls for minimizing a piecewise constant and highly nonconvex empirical risk function f (w) subject to a discrete set-constraint w ∈ Q that characterizes the quantized weights. In particular, weight quantization of DNN have been extensively studied in the literature; see for examples (Li et al, 2016;Zhu et al, 2016;Yin et al, 2016;2018a;Hou & Kwok, 2018;He et al, 2018;Li & Hao, 2018). On the other hand, the gradient ∇f (w) in training activation quantized DNN is almost everywhere (a.e.)…”

Section: Introductionmentioning

confidence: 99%

Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets

Yin,

Lyu,

Zhang

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Training activation quantized neural networks involves minimizing a piecewise constant function whose gradient vanishes almost everywhere, which is undesirable for the standard back-propagation or chain rule. An empirical way around this issue is to use a straight-through estimator (STE) (Bengio et al., 2013) in the backward pass only, so that the "gradient" through the modified chain rule becomes non-trivial. Since this unusual "gradient" is certainly not the gradient of loss function, the following question arises: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a two-linear-layer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual "gradient" given by the STE-modifed chain rule as coarse gradient. The choice of STE is not unique. We prove that if the STE is properly chosen, the expected coarse gradient correlates positively with the population gradient (not available for the training), and its negation is a descent direction for minimizing the population loss. We further show the associated coarse gradient descent algorithm converges to a critical point of the population loss minimization problem. Moreover, we show that a poor choice of STE leads to instability of the training algorithm near certain local minima, which is verified with CIFAR-10 experiments.

show abstract

Blended coarse gradient descent for full quantization of deep neural networks

et al. 2019

Self Cite

View full text Add to dashboard Cite

Quantized deep neural networks (QDNNs) are attractive due to their much lower memory storage and faster inference speed than their regular full precision counterparts. To maintain the same performance level especially at low bit-widths, QDNNs must be retrained. Their training involves piecewise constant activation functions and discrete weights, hence mathematical challenges arise. We introduce the notion of coarse gradient and propose the blended coarse gradient descent (BCGD) algorithm, for training fully quantized neural networks. Coarse gradient is generally not a gradient of any function but an artificial ascent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the full precision weights and their quantization (the so-called blending), which yields sufficient descent in the objective value and thus accelerates the training. Our experiments demonstrate that this simple blending technique is very effective for quantization at extremely low bit-width such as binarization. In full quantization of ResNet-18 for ImageNet classification task, BCGD gives 64.36% top-1 accuracy with binary weights across all layers and 4-bit adaptive activation. If the weights in the first and last layers are kept in full precision, this number increases to 65.46%. As theoretical justification, we show convergence analysis of coarse gradient descent for a two-linear-layer neural network model with Gaussian input data, and prove that the expected coarse gradient correlates positively with the underlying true gradient.Keywords weight/activation quantization · blended coarse gradient descent · sufficient descent property · deep neural networks Mathematics Subject Classification (2010) 90C35, 90C26, 90C52, 90C90.

show abstract

BinaryRelax: A Relaxation Approach For Training Deep Neural Networks With Quantized Weights

Cited by 14 publications

References 24 publications

ReLU Deep Neural Networks and Linear Finite Elements

ReLU Deep Neural Networks and Linear Finite Elements

Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets

Blended coarse gradient descent for full quantization of deep neural networks

Contact Info

Product

Resources

About