Greedy Layerwise Learning Can Scale to ImageNet

Belilovsky, Eugene; Eickenberg, Michael; Oyallon, Edouard

doi:10.48550/arxiv.1812.11446

Cited by 11 publications

(22 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[18] used the layer-wise method to train residual blocks in ResNet sequentially, then refined the network with the standard end-to-end training. [19] studied the progressive separability of layer-wise trained supervised neural networks and demonstrated Greedy layer-wise Learning (GLL) can scale to large-scale datasets like ImageNet. Other attempts at supervised layer-wise learning involve a synthetic gradient [20] and a layer-wise loss that combines local classifier and similarity matching loss [21].…”

Section: Related Workmentioning

confidence: 99%

“…Greedy and Randomized Layer-wise Learning. We adapted the supervised Greedy Layer-wise Learning (GLL) [19] method to self-supervised learning by training convolutional layers sequentially with auxiliary heads and self-supervised loss, as shown in Figure 1. The base encoders are trained layer by layer.…”

Section: Layer-wise Learning With Random Feedbackmentioning

confidence: 99%

“…Such learning can be done sequentially bottom up, or randomly, updating a different random layer for each batch. The sequential form of learning has been studied in a number of papers [16,17,18,19,20,21]. However the sequential approach requires rigid timing of the updates of each layer, which would seem unlikely in the biological setting.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Biologically Plausible Training Mechanisms for Self-Supervised Learning in Deep Networks

Tang,

Yang,

Amit

2021

Preprint

View full text Add to dashboard Cite

We develop biologically plausible training mechanisms for self-supervised learning (SSL) in deep networks. SSL, with a contrastive loss, is more natural as it does not require labelled data and its robustness to perturbations yields more adaptable embeddings. Moreover the perturbation of data required to create positive pairs for SSL is easily produced in a natural environment by observing objects in motion and with variable lighting over time. We propose a contrastive hinge based loss whose error involves simple local computations as opposed to the standard contrastive losses employed in the literature, which do not lend themselves easily to implementation in a network architecture due to complex computations involving ratios and inner products. Furthermore we show that learning can be performed with one of two more plausible alternatives to backpropagation. The first is difference target propagation (DTP), which trains network parameters using target-based local losses and employs a Hebbian learning rule, thus overcoming the biologically implausible symmetric weight problem in backpropagation. The second is simply layer-wise learning, where each layer is directly connected to a layer computing the loss error. The layers are either updated sequentially in a greedy fashion (GLL) or in random order (RLL), and each training stage involves a single hidden layer network. The one step backpropagation needed for each such network can either be altered with fixed random feedback weights as proposed in [1], or using updated random feedback as in [2]. Both methods represent alternatives to the symmetric weight issue of backpropagation. By training convolutional neural networks (CNNs) with SSL and DTP, GLL or RLL, we find that our proposed framework achieves comparable performance to its implausible counterparts in both linear evaluation and transfer learning tasks.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Layer-wise Learning With Random Feedbackmentioning

confidence: 99%

See 1 more Smart Citation

Biologically Plausible Training Mechanisms for Self-Supervised Learning in Deep Networks

Tang,

Yang,

Amit

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We survey three categories of BP literature-(i) better hardware implementation of BP [15,16,31,11,32,25], (ii) workarounds to approximate BP [33,7,10], and (iii) biologically inspired algorithms. Biologically inspired algorithms can further be segregated into four types: (i) Inspired from biological observations [29,7,26,17], these works try to approximate BP with the intention resolve its biological implausibility, (ii) Propagation of an alternative to error [19,21], (iii) Leveraging local errors, the power of single layer networks, and layer wise pre-training to approximate BP [24,23,3], (iv) Resolving the locking problem using decoupling [14,6,12,1,20] and its variants [27,8,22,4]. We were deeply motivated by (ii), (iii), and (iv) while coming up with the idea of 'front contributions'-specifically, propagating something other than error, the idea of a single layer network, and decoupling, collectively inspire 'front contributions'.…”

Section: Introduction and Related Workmentioning

confidence: 99%

Front Contribution instead of Back Propagation

Mishra¹,

Arunkumar²

2021

Preprint

View full text Add to dashboard Cite

Deep Learning's outstanding track record across several domains has stemmed from the use of error backpropagation (BP). Several studies, however, have shown that it is impossible to execute BP in a real brain. Also, BP still serves as an important and unsolved bottleneck for memory usage and speed. We propose a simple, novel algorithm, the Front-Contribution algorithm, as a compact alternative to BP. The contributions of all weights with respect to the final layer weights are calculated before training commences and all the contributions are appended to weights of the final layer, i.e., the effective final layer weights are a non-linear function of themselves. Our algorithm then essentially collapses the network, precluding the necessity for weight updation of all weights not in the final layer. This reduction in parameters results in lower memory usage and higher training speed. We show that our algorithm produces the exact same output as BP, in contrast to several recently proposed algorithms approximating BP. Our preliminary experiments demonstrate the efficacy of the proposed algorithm. Our work provides a foundation to effectively utilize these presently under-explored "front contributions", and serves to inspire the next generation of training algorithms.

show abstract

“…using gradient descent, where F denotes the network function space. Studying inductive bias in the context of autoencoders is relevant since (1) components of convolutional autoencoders are building blocks of many CNNs; (2) layerwise pre-training using autoencoders is a standard technique to initialize individual layers of CNNs to improve training [2,5,8]; and (3) autoencoder architectures are used in many image-to-image tasks such as image segmentation or impainting [25]. Furthermore, the inductive bias that we characterize in autoencoders may apply to more general architectures.…”

Section: Introductionmentioning

confidence: 99%

Memorization in Overparameterized Autoencoders

Radhakrishnan¹,

Yang²,

Belkin³

et al. 2018

Preprint

View full text Add to dashboard Cite

The ability of deep neural networks to generalize well in the overparameterized regime has become a subject of significant research interest. We show that overparameterized autoencoders exhibit memorization, a form of inductive bias that constrains the functions learned through the optimization process to concentrate around the training examples, although the network could in principle represent a much larger function class. In particular, we prove that single-layer fully-connected autoencoders project data onto the (nonlinear) span of the training examples. In addition, we show that deep fully-connected autoencoders learn a map that is locally contractive at the training examples, and hence iterating the autoencoder results in convergence to the training examples. Finally, we prove that depth is necessary and provide empirical evidence that it is also sufficient for memorization in convolutional autoencoders. Understanding this inductive bias may shed light on the generalization properties of overparametrized deep neural networks that are currently unexplained by classical statistical theory.

show abstract

Greedy Layerwise Learning Can Scale to ImageNet

Cited by 11 publications

References 37 publications

Biologically Plausible Training Mechanisms for Self-Supervised Learning in Deep Networks

Biologically Plausible Training Mechanisms for Self-Supervised Learning in Deep Networks

Front Contribution instead of Back Propagation

Memorization in Overparameterized Autoencoders

Contact Info

Product

Resources

About