A grand challenge in representation learning is the development of computational algorithms that learn the different explanatory factors of variation behind highdimensional data. Representation models (usually referred to as encoders) are often determined for optimizing performance on training data when the real objective is to generalize well to other (unseen) data. The first part of this chapter is devoted to provide an overview of and introduction to fundamental concepts in statistical learning theory and the Information Bottleneck principle. It serves as a mathematical basis for the technical results given in the second part, in which an upper bound to the generalization gap corresponding to the crossentropy risk is given. When this penalty term times a suitable multiplier and the cross entropy empirical risk are minimized jointly, the problem is equivalent to optimizing the Information Bottleneck objective with respect to the empirical data distribution. This result provides an interesting connection between mutual information and generalization, and helps to explain why noise injection during the training phase can improve the generalization ability of encoder models and enforce invariances in the resulting representations.
Information Bottleneck and Representation Learningat one point a message selected at another point." Shannon further argued that the meaning of a message is subjective, i.e., dependent on the observer, and irrelevant to the engineering problem of communication. However, what does matter for the theory of communication is finding suitable representations for given data. In source coding, for example, one generally aims at distilling the relevant information from the data by removing unnecessary redundancies. This can be cast in information-theoretic terms, as higher redundancy makes data more predictable and lowers its information content.In the context of learning [3,4], we propose to distinguish these two rather different aspects of data: information and knowledge. Information contained in data is unpredictable and random, while additional structure and redundancy in the data stream constitutes knowledge about the data generation process, which a learner must acquire. Indeed, according to connectionist models [5], the redundancy contained within messages enables the brain to build up its cognitive maps and the statistical regularities in these messages are being used for this purpose. Hence, this knowledge, provided by redundancy [6,7] in the data, must be what drives unsupervised learning. While information theory is a unique success story, from its birth, it discarded knowledge as being irrelevant to the engineering problem of communication. However, knowledge is recognized as being a critical -almost central-component of representation learning. The present monograph provides an information-theoretic treatment of this problem.Knowledge representation. The data deluge of recent decades leads to new expectations for scientific discoveries from massive data. While mankind is drowning in...