We propose BinaryRelax, a simple two-phase algorithm, for training deep neural networks with quantized weights. The set constraint that characterizes the quantization of weights is not imposed until the late stage of training, and a sequence of pseudo quantized weights is maintained. Specifically, we relax the hard constraint into a continuous regularizer via Moreau envelope, which turns out to be the squared Euclidean distance to the set of quantized weights. The pseudo quantized weights are obtained by linearly interpolating between the float weights and their quantizations. A continuation strategy is adopted to push the weights towards the quantized state by gradually increasing the regularization parameter. In the second phase, exact quantization scheme with a small learning rate is invoked to guarantee fully quantized weights. We test BinaryRelax on the benchmark CIFAR and ImageNet color image datasets to demonstrate the superiority of the relaxed quantization approach and the improved accuracy over the state-of-the-art training methods. Finally, we prove the convergence of BinaryRelax under an approximate orthogonality condition.
Quantized deep neural networks (QDNNs) are attractive due to their much lower memory storage and faster inference speed than their regular full precision counterparts. To maintain the same performance level especially at low bit-widths, QDNNs must be retrained. Their training involves piecewise constant activation functions and discrete weights, hence mathematical challenges arise. We introduce the notion of coarse gradient and propose the blended coarse gradient descent (BCGD) algorithm, for training fully quantized neural networks. Coarse gradient is generally not a gradient of any function but an artificial ascent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the full precision weights and their quantization (the so-called blending), which yields sufficient descent in the objective value and thus accelerates the training. Our experiments demonstrate that this simple blending technique is very effective for quantization at extremely low bit-width such as binarization. In full quantization of ResNet-18 for ImageNet classification task, BCGD gives 64.36% top-1 accuracy with binary weights across all layers and 4-bit adaptive activation. If the weights in the first and last layers are kept in full precision, this number increases to 65.46%. As theoretical justification, we show convergence analysis of coarse gradient descent for a two-linear-layer neural network model with Gaussian input data, and prove that the expected coarse gradient correlates positively with the underlying true gradient.Keywords weight/activation quantization · blended coarse gradient descent · sufficient descent property · deep neural networks Mathematics Subject Classification (2010) 90C35, 90C26, 90C52, 90C90.
We study the residual diffusion phenomenon in chaotic advection computationally via adaptive orthogonal basis. The chaotic advection is generated by a class of time periodic cellular flows arising in modeling transition to turbulence in Rayleigh-Bénard experiments. The residual diffusion refers to the non-zero effective (homogenized) diffusion in the limit of zero molecular diffusion as a result of chaotic mixing of the streamlines. In this limit, the solutions of the advection-diffusion equation develop sharp gradients, and demand a large number of Fourier modes to resolve, rendering computation expensive. We construct adaptive orthogonal basis (training) with built-in sharp gradient structures from fully resolved spectral solutions at few sampled molecular diffusivities. This is done by taking snapshots of solutions in time, and performing singular value decomposition of the matrix consisting of these snapshots as column vectors. The singular values decay rapidly and allow us to extract a small percentage of left singular vectors corresponding to the top singular values as adaptive basis vectors. The trained orthogonal adaptive basis makes possible low cost computation of the effective diffusivities at smaller molecular diffusivities (testing). The testing errors decrease as the training occurs at smaller molecular diffusivities. We make use of the Poincaré map of the advection-diffusion equation to bypass long time simulation and gain accuracy in computing effective diffusivity and learning adaptive basis. We observe a non-monotone relationship between residual diffusivity and the amount of chaos in the advection, though the overall trend is that sufficient chaos leads to higher residual diffusivity.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.