A method to reduce the rejection rate in Monte Carlo Markov chains

Baldassi, Carlo

doi:10.1088/1742-5468/aa5335

Cited by 6 publications

(8 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…2 , we report the analytical predictions for the average classical component of the energy of the quantum model as a function of the transverse field

. We compare the results with the outcome of extensive simulations performed with the reduced-rejection-rate (RRR) Monte Carlo method ( 37 ), in which

is initialized at

2.5

and gradually brought down to

0

in regular small steps, at constant temperature, and fixing the total simulation time to

(as to keep constant the number of Monte Carlo sweeps when varying

and

). Additional details are reported in Materials and Methods and SI Appendix .…”

Section: Phase Diagram: Analytical and Numerical Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Efficiency of quantum vs. classical annealing in nonconvex learning problems

Baldassi

Zecchina

2018

Proc. Natl. Acad. Sci. U.S.A.

Self Cite

View full text Add to dashboard Cite

SignificanceQuantum annealers are physical quantum devices designed to solve optimization problems by finding low-energy configurations of an appropriate energy function by exploiting cooperative tunneling effects to escape local minima. Classical annealers use thermal fluctuations for the same computational purpose, and Markov chains based on this principle are among the most widespread optimization techniques. The fundamental mechanism underlying quantum annealing consists of exploiting a controllable quantum perturbation to generate tunneling processes. The computational potentialities of quantum annealers are still under debate, since few ad hoc positive results are known. Here, we identify a wide class of large-scale nonconvex optimization problems for which quantum annealing is efficient while classical annealing gets stuck. These problems are of central interest to machine learning.

show abstract

“…2 , we report the analytical predictions for the average classical component of the energy of the quantum model as a function of the transverse field

. We compare the results with the outcome of extensive simulations performed with the reduced-rejection-rate (RRR) Monte Carlo method ( 37 ), in which

is initialized at

2.5

and gradually brought down to

0

in regular small steps, at constant temperature, and fixing the total simulation time to

(as to keep constant the number of Monte Carlo sweeps when varying

and

). Additional details are reported in Materials and Methods and SI Appendix .…”

Section: Phase Diagram: Analytical and Numerical Resultsmentioning

confidence: 99%

“…All SQA simulations were performed by using the RRR Monte Carlo method ( 37 ). We fixed the total number of spin flip attempts at

and followed a linear protocol (divided in

steps) for the annealing of

.…”

Section: Methodsmentioning

confidence: 99%

Efficiency of quantum vs. classical annealing in nonconvex learning problems

Baldassi

Zecchina

2018

Proc. Natl. Acad. Sci. U.S.A.

Self Cite

View full text Add to dashboard Cite

show abstract

“…For the numerical results, we have used simulated annealing on a system with K = 32 (K = 33) for the ReLU (sign) activations (respectively), and N = K 2 10 3 . We have simulated a system of y interacting replicas that is able to sample from the local-entropic measure [6] with the RRR Monte Carlo method [21], ensuring that the annealing process was sufficiently slow such that at the end of the simulation all replicas were solutions, and controlling the interaction such that the average overlap between replicas was equal to q 1 within a tolerance of 0.01. The results were averaged over 20 samples.…”

mentioning

confidence: 99%

Properties of the Geometry of Solutions and Capacity of Multilayer Neural Networks with Rectified Linear Unit Activations

2019

Self Cite

View full text Add to dashboard Cite

Rectified Linear Units (ReLU) have become the main model for the neural units in current deep learning systems. This choice has been originally suggested as a way to compensate for the so called vanishing gradient problem which can undercut stochastic gradient descent (SGD) learning in networks composed of multiple layers. Here we provide analytical results on the effects of ReLUs on the capacity and on the geometrical landscape of the solution space in two-layer neural networks with either binary or real-valued weights. We study the problem of storing an extensive number of random patterns and find that, quite unexpectedly, the capacity of the network remains finite as the number of neurons in the hidden layer increases, at odds with the case of threshold units in which the capacity diverges. Possibly more important, a large deviation approach allows us to find that the geometrical landscape of the solution space has a peculiar structure: While the majority of solutions are close in distance but still isolated, there exist rare regions of solutions which are much more dense than the similar ones in the case of threshold units. These solutions are robust to perturbations of the weights and can tolerate large perturbations of the inputs. The analytical results are corroborated by numerical findings. arXiv:1907.07578v3 [cond-mat.dis-nn]

show abstract

“…We did in fact generalize and improve this scheme after the preparation of this manuscript, see[38].…”

mentioning

confidence: 99%

Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes

Baldassi

Borgs

Chayes

et al. 2016

Proc. Natl. Acad. Sci. U.S.A.

Self Cite

134

196

View full text Add to dashboard Cite

In artificial neural networks, learning from data is a computationally demanding task in which a large number of connection weights are iteratively tuned through stochastic-gradient-based heuristic processes over a cost function. It is not well understood how learning occurs in these systems, in particular how they avoid getting trapped in configurations with poor computational performance. Here, we study the difficult case of networks with discrete weights, where the optimization landscape is very rough even for simple architectures, and provide theoretical and numerical evidence of the existence of rare-but extremely dense and accessible-regions of configurations in the network weight space. We define a measure, the robust ensemble (RE), which suppresses trapping by isolated configurations and amplifies the role of these dense regions. We analytically compute the RE in some exactly solvable models and also provide a general algorithmic scheme that is straightforward to implement: define a cost function given by a sum of a finite number of replicas of the original cost function, with a constraint centering the replicas around a driving assignment. To illustrate this, we derive several powerful algorithms, ranging from Markov Chains to message passing to gradient descent processes, where the algorithms target the robust dense states, resulting in substantial improvements in performance. The weak dependence on the number of precision bits of the weights leads us to conjecture that very similar reasoning applies to more conventional neural networks. Analogous algorithmic schemes can also be applied to other optimization problems.machine learning | neural networks | statistical physics | optimization T here is increasing evidence that artificial neural networks perform exceptionally well in complex recognition tasks (1). Despite huge numbers of parameters and strong nonlinearities, learning often occurs without getting trapped in local minima with poor prediction performance (2). The remarkable output of these models has created unprecedented opportunities for machine learning in a host of applications. However, these practical successes have been guided by intuition and experiments, whereas obtaining a complete theoretical understanding of why these techniques work seems currently out of reach, due to the inherent complexity of the problem. In other words, in practical applications, large and complex architectures are trained on big and rich datasets using an array of heuristic improvements over basic stochastic gradient descent (SGD). These heuristic enhancements over a stochastic process have the general purpose of improving the convergence and robustness properties (and therefore the generalization properties) of the networks, with respect to what would be achieved with a pure gradient descent (GD) on a cost function.There are many parallels between the studies of algorithmic stochastic processes and out-of-equilibrium processes in complex systems. Examples include jamming processes in physics, local search alg...

show abstract

A method to reduce the rejection rate in Monte Carlo Markov chains

Cited by 6 publications

References 20 publications

Efficiency of quantum vs. classical annealing in nonconvex learning problems

Efficiency of quantum vs. classical annealing in nonconvex learning problems

Properties of the Geometry of Solutions and Capacity of Multilayer Neural Networks with Rectified Linear Unit Activations

Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes

Contact Info

Product

Resources

About