Entropy-SGD: biasing gradient descent into wide valleys*

Chaudhari, Pratik; Choromanska, Anna; Soatto, Stefano; LeCun, Yann; Baldassi, Carlo; Borgs, Christian; Chayes, Jennifer; Sagun, Levent; Zecchina, Riccardo

doi:10.1088/1742-5468/ab39d9

Cited by 361 publications

(430 citation statements)

References 50 publications

Supporting

Mentioning

411

Contrasting

Order By: Relevance

“…Intuitively, our idea of MSKT is in line with recent studies on the robustness of high posterior entropy [40], [44]. In our MSKT, each auxiliary branch only aims to better learn the specific knowledge from a certain dataset.…”

Section: B Multi-site-guided Knowledge Transfermentioning

confidence: 94%

“…While in MSKT, the universal network has to mimic the ground truth label and the predictions of multiple auxiliary branches simultaneously. Compared with conventional supervised learning, MSKT provides additional multi-site information to regularize the universal network and increases its posterior entropy [40], which helps the shared kernels to explore more robust representation among multiple datasets. Moreover, the multi-branch architecture in MSKT could also perform as a positive feature regularization to the universal encoder by jointly training the auxiliary branches and the universal network.…”

Section: B Multi-site-guided Knowledge Transfermentioning

confidence: 99%

See 1 more Smart Citation

MS-Net: Multi-Site Network for Improving Prostate Segmentation With Heterogeneous MRI Data

Liu

Dou

et al. 2020

IEEE Trans. Med. Imaging

236

154

View full text Add to dashboard Cite

Automated prostate segmentation in MRI is highly demanded for computer-assisted diagnosis. Recently, a variety of deep learning methods have achieved remarkable progress in this task, usually relying on large amounts of training data. Due to the nature of scarcity for medical images, it is important to effectively aggregate data from multiple sites for robust model training, to alleviate the insufficiency of single-site samples. However, the prostate MRIs from different sites present heterogeneity due to the differences in scanners and imaging protocols, raising challenges for effective ways of aggregating multi-site data for network training. In this paper, we propose a novel multisite network (MS-Net) for improving prostate segmentation by learning robust representations, leveraging multiple sources of data. To compensate for the inter-site heterogeneity of different MRI datasets, we develop Domain-Specific Batch Normalization layers in the network backbone, enabling the network to estimate statistics and perform feature normalization for each site separately. Considering the difficulty of capturing the shared knowledge from multiple datasets, a novel learning paradigm, i.e., Multi-site-guided Knowledge Transfer, is proposed to enhance the kernels to extract more generic representations from multi-site data. Extensive experiments on three heterogeneous prostate MRI datasets demonstrate that our MS-Net improves the performance across all datasets consistently, and outperforms state-of-the-art methods for multi-site learning.

show abstract

Section: B Multi-site-guided Knowledge Transfermentioning

confidence: 94%

Section: B Multi-site-guided Knowledge Transfermentioning

confidence: 99%

MS-Net: Multi-Site Network for Improving Prostate Segmentation With Heterogeneous MRI Data

Liu

Dou

et al. 2020

IEEE Trans. Med. Imaging

236

154

View full text Add to dashboard Cite

show abstract

“…This prevents the dynamic from getting trapped in local maxima, which is crucial for non-convex optimization landscapes, and it also directs the dynamics toward minima with wider basins of attraction [23]. It has been argued that the later effect contributes in improving generalization performance [24][25][26]. Though the convergence rate of SGA has a slower asymptotic rate than ordinary gradient descent, it often does not matter in practice for finite data sets, as the performance on the test set usually does not improve anymore once the asymptotic regime is reached [27].…”

Section: Stochastic Optimizationmentioning

confidence: 99%

‘Place-cell’ emergence and learning of invariant data with restricted Boltzmann machines: breaking and dynamical restoration of continuous symmetries in the weight space

Harsh¹,

Tubiana²,

Cocco³

et al. 2020

J. Phys. A: Math. Theor.

View full text Add to dashboard Cite

Distributions of data or sensory stimuli often enjoy underlying invariances. How and to what extent those symmetries are captured by unsupervised learning methods is a relevant question in machine learning and in computational neuroscience. We study here, through a combination of numerical and analytical tools, the learning dynamics of Restricted Boltzmann Machines (RBM), a neural network paradigm for representation learning. As learning proceeds from a random configuration of the network weights, we show the existence of, and characterize a symmetry-breaking phenomenon, in which the latent variables acquire receptive fields focusing on limited parts of the invariant manifold supporting the data. The symmetry is restored at large learning times through the diffusion of the receptive field over the invariant manifold; hence, the RBM effectively spans a continuous attractor in the space of network weights. This symmetry-breaking phenomenon takes place only if the amount of data available for training exceeds some critical value, depending on the network size and the intensity of symmetry-induced correlations in the data; below this 'retarded-learning' threshold, the network weights are essentially noisy and overfit the data. I. INTRODUCTIONMany high-dimensional inputs or data enjoy various kinds of low-dimensional invariances, which are at the basis of the socalled manifold hypothesis [1]. For instance, the pictures of somebody's face are related to each other through a set of continuous symmetries corresponding to the degrees of freedom characterizing the relative position of the camera (rotations, translations, changes of scales) as well as the internal deformations of the face (controlled by muscles). While well-understood symmetries can be explicitely taken care of through adequate procedures, e.g. convolutional networks, not all invariances may be known a priori. An interesting question is therefore if and how these residual symmetries affect the representations of the data achieved by learning models.This question does not arise solely in the context of machine learning, but is also of interest in computational neuroscience, where it is of crucial importance to understand how the statistical structure of input stimuli, be they visual, olfactive, auditory, tactile, ... shapes their encoding by sensory brain areas and their processing by higher cortical regions. Information theory provides a mathematical framework to answer this question [2], and was applied, in the case of linear models of neurons, to a variety of situations, including the prediction of the receptive fields of retinal ganglion cells [3], the determination of cone fractions in the human retina [4] or the efficient representation of odor-variable environments [5]. In the case of natural images, which enjoy approximate translational and rotational invariances, non-linear learning rules resulting from adequate modification of Oja's dynamics [6] or sparse-representation learning procedures [7] produce local edge detectors, such as do independent co...

show abstract

“…Intuitively, one expects weights with a high local entropy to generalize well since they are more robust with respect to perturbations of the parameters and the data and therefore less likely to be an artifact of overfitting. In fact, such flat minima have already been found to have better generalization properties for some deep architectures [15]. We call the optimization procedure that finds such minima robust optimization.…”

Section: Replicated Systems and Overfittingmentioning

confidence: 97%

“…Therefore we take special measures to avoid overfitting, which would decrease generalization performance and would constitute evidence that the current representation in the hidden units is not corresponding to the basic features in the data. We note that in neural networks, overfitting has been connected to sharp minima in the loss function [14,15,16]. To avoid such minima, we modify the optimization procedure to prefer weights that are in the vicinity of other weights that have low loss, which is a measure of the flatness of the loss landscape.…”

Section: Replicated Systems and Overfittingmentioning

confidence: 99%

Natural representation of composite data with replicated autoencoders

Negri

Davide

Baldassi

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Generative processes in biology and other fields often produce data that can be regarded as resulting from a composition of basic features. Here we present an unsupervised method based on autoencoders for inferring these basic features of data. The main novelty in our approach is that the training is based on the optimization of the 'local entropy' rather than the standard loss, resulting in a more robust inference, and enhancing the performance on this type of data considerably. Algorithmically, this is realized by training an interacting system of replicated autoencoders. We apply this method to synthetic and protein sequence data, and show that it is able to infer a hidden representation that correlates well with the underlying generative process, without requiring any prior knowledge. AUTHOR SUMMARYExtracting compositional features from noisy data and identifying the corresponding generative models is a fundamental challenge across sciences. The composition of elementary features can have highly non-linear effects which makes them very hard to identify from experimental data. In biology, for instance, one challenge is to identify the key steps or components of molecular and cellular processes. Representative examples are the modeling of protein sequences as the composition of patterns influenced by phylogeny or the identification of gene clusters in which the presence of specific genes depends on the evolutionary history of the cell. Here we present an unsupervised machine learning technique for the analysis of compositional data which is based on entropic neural autoencoders. Our approach aims at finding deep autoencoders that are highly invariant with respect to perturbations in the inputs and in the parameters. The procedure is efficient to implement and we have validated it both on synthetic and protein sequence data, where it can be shown that the latent variables of the autoencoders are non trivially correlated with the true underlying generative processes. Our results suggests that the local entropy approach represents a general valuable tool for the extraction of compositional features in hard unsupervised learning problems.

show abstract

Entropy-SGD: biasing gradient descent into wide valleys*

Cited by 361 publications

References 50 publications

MS-Net: Multi-Site Network for Improving Prostate Segmentation With Heterogeneous MRI Data

MS-Net: Multi-Site Network for Improving Prostate Segmentation With Heterogeneous MRI Data

‘Place-cell’ emergence and learning of invariant data with restricted Boltzmann machines: breaking and dynamical restoration of continuous symmetries in the weight space

Natural representation of composite data with replicated autoencoders

Contact Info

Product

Resources

About