The Fisher information matrix (FIM) is a fundamental quantity to represent the characteristics of a stochastic model, including deep neural networks (DNNs). The present study reveals novel statistics of FIM that are universal among a wide class of DNNs. To this end, we use random weights and large width limits, which enables us to utilize mean field theories. We investigate the asymptotic statistics of the FIM’s eigenvalues and reveal that most of them are close to zero while the maximum eigenvalue takes a huge value. Because the landscape of the parameter space is defined by the FIM, it is locally flat in most dimensions, but strongly distorted in others. Moreover, we demonstrate the potential usage of the derived statistics in learning strategies. First, small eigenvalues that induce flatness can be connected to a norm-based capacity measure of generalization ability. Second, the maximum eigenvalue that induces the distortion enables us to quantitatively estimate an appropriately sized learning rate for gradient methods to converge.
Two geometrical structures have been extensively studied for a manifold of probability distributions. One is based on the Fisher information metric, which is invariant under reversible transformations of random variables, while the other is based on the Wasserstein distance of optimal transportation, which reflects the structure of the distance between underlying random variables. Here, we propose a new information-geometrical theory that provides a unified framework connecting the Wasserstein distance and Kullback-Leibler (KL) divergence. We primarily considered a discrete case consisting of n elements and studied the geometry of the probability simplex S n−1 , which is the set of all probability distributions over n elements. The Wasserstein distance was introduced in S n−1 by the optimal transportation of commodities from distribution p to distribution q, where p, q ∈ S n−1 . We relaxed the optimal transportation by using entropy, which was introduced by Cuturi. The optimal solution was called the entropy-relaxed stochastic transportation plan. The entropyrelaxed optimal cost C( p, q) was computationally much less demanding than the original Wasserstein distance but does not define a distance because it is not minimized at p = q. To define a proper divergence while retaining the computational advantage, we first introduced a divergence function in the manifold S n−1 × S n−1 composed of all optimal transportation plans. We fully explored the information geometry of the manifold of the optimal transportation plans and subsequently constructed a new one-parameter family of divergences in S n−1 that are related to both the Wasserstein distance and the KL-divergence.B Shun-ichi Amari
The Fisher information matrix (FIM) is a fundamental quantity to represent the characteristics of a stochastic model, including deep neural networks (DNNs). The present study reveals novel statistics of FIM that are universal among a wide class of DNNs. To this end, we use random weights and large width limits, which enables us to utilize mean field theories. We investigate the asymptotic statistics of the FIM's eigenvalues and reveal that most of them are close to zero while the maximum takes a huge value. This implies that the eigenvalue distribution has a long tail. Because the landscape of the parameter space is defined by the FIM, it is locally flat in most dimensions, but strongly distorted in others. We also demonstrate the potential usage of the derived statistics through two exercises. First, small eigenvalues that induce flatness can be connected to a norm-based capacity measure of generalization ability. Second, the maximum eigenvalue that induces the distortion enables us to quantitatively estimate an appropriately sized learning rate for gradient methods to converge.
We propose a new divergence on the manifold of probability distributions, building on the entropic regularization of optimal transportation problems. As Cuturi ( 2013 ) showed, regularizing the optimal transport problem with an entropic term is known to bring several computational benefits. However, because of that regularization, the resulting approximation of the optimal transport cost does not define a proper distance or divergence between probability distributions. We recently tried to introduce a family of divergences connecting the Wasserstein distance and the Kullback-Leibler divergence from an information geometry point of view (see Amari, Karakida, & Oizumi, 2018 ). However, that proposal was not able to retain key intuitive aspects of the Wasserstein geometry, such as translation invariance, which plays a key role when used in the more general problem of computing optimal transport barycenters. The divergence we propose in this work is able to retain such properties and admits an intuitive interpretation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.