In this theory paper, we investigate training deep neural networks (DNNs) for classification via minimizing the information bottleneck (IB) functional. We show that the resulting optimization problem suffers from two severe issues: First, for deterministic DNNs, either the IB functional is infinite for almost all values of network parameters, making the optimization problem ill-posed, or it is piecewise constant, hence not admitting gradient-based optimization methods. Second, the invariance of the IB functional under bijections prevents it from capturing properties of the learned representation that are desirable for classification, such as robustness and simplicity. We argue that these issues are partly resolved for stochastic DNNs, DNNs that include a (hard or soft) decision rule, or by replacing the IB functional with related, but more well-behaved cost functions. We conclude that recent successes reported about training DNNs using the IB framework must be attributed to such solutions. As a side effect, our results indicate limitations of the IB framework for the analysis of DNNs. We also note that rather than trying to repair the inherent problems in the IB functional, a better approach may be to design regularizers on latent representation enforcing the desired properties directly.Index Terms-deep learning, information bottleneck, representation learning, regularization, classification, neural networks, stochastic neural networks. !Since 2014 he is pursuing his PhD at the Institute for Communication Engineering at Technical University of Munich. He has received various awards in his academic career including the faculty award for best Master thesis, award for outstanding performance in Master's degree and Gold medal for best performance in Communications major during his Bachelors degree. His research interests cover information theory, machine learning, communication theory, channel coding and information-theoretic security.
This paper takes a rate-distortion approach to understanding the information-theoretic laws governing cache-aided communications systems. Specifically, we characterise the optimal tradeoffs between the delivery rate, cache capacity and reconstruction distortions for a single-user problem and some special cases of a two-user problem. Our analysis considers discrete memoryless sources, expected-and excess-distortion constraints, and separable and f-separable distortion functions. We also establish a strong converse for separable-distortion functions, and we show that lossy versions of common information (Gács-Körner and Wyner) play an important role in caching. Finally, we illustrate and explicitly evaluate these laws for multivariate Gaussian sources and binary symmetric sources.
Abstract. Sum-product networks allow to model complex variable interactions while still granting efficient inference. However, the learning algorithms proposed so far are explicitly or implicitly restricted to the image domain, either by assuming variable neighborhood or by assuming that dependent variables are related by their values over the training set. In this paper, we introduce a novel algorithm, learning the structure and parameters of sum-product networks in a greedy bottom-up manner. Our algorithm subsequently merges probabilistic models of small variable scope to larger and more complex models. These merges are guided by statistical dependence test, and parameters are learned using a maximum mutual information principle. In experiments we show that our method competes well with the existing learning algorithms for sumproduct networks on the task of reconstructing covered image regions, and outperforms these when neither neighborhood nor variable relation by value can be assumed.
Finite precision approximations of discrete probability distributions are considered, applicable for distribution synthesis, e.g., probabilistic shaping. Two algorithms are presented that find the optimal $M$-type approximation $Q$ of a distribution $P$ in terms of the variational distance $| Q-P|_1$ and the informational divergence $\mathbb{D}(Q| P)$. Bounds on the approximation errors are derived and shown to be asymptotically tight. Several examples illustrate that the variational distance optimal approximation can be quite different from the informational divergence optimal approximation.Comment: Submitted to the IEEE Transactions on Information Theor
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.