Learnability for the Information Bottleneck

Wu, Tong; Fischer, Ian; Chuang, Isaac L.; Tegmark, Max

doi:10.3390/e21100924

Cited by 29 publications

(46 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The range of the Lagrange multipliers that allow the exploration of the IB curve is contained by

which is also contained by

, where

where

is the derivative of

w.r.t.

evaluated at r,

is the set of possible realizations of X and

and

are defined as in [ 27 ] (Note in [ 27 ] they consider the dual problem (see Appendix G ), so when they refer to

it translates to β in this article). That is,

.…”

Section: The Convex Ib Lagrangianmentioning

confidence: 99%

“…Corollaries 2 and 3 allow us to reduce the range search for

when we want to explore the IB curve. Practically,

might be difficult to calculate so Wu et al [ 27 ] derived an algorithm to approximate it. However, we still recommend setting the numerator to 1 for simplicity.…”

Section: The Convex Ib Lagrangianmentioning

confidence: 99%

“…The main difference comes from the discontinuities in performance for increasing

, which cause is still unknown (cf. Wu et al [ 27 ]). It has been observed, however, that the bottleneck variable performs an intrinsic clusterization in classification tasks (see, for instance, [ 21 , 26 , 42 ] or Figure 2 b).…”

Section: Experimental Supportmentioning

confidence: 99%

“…The Lagrange multiplier selection is important since (i) sometimes even choices of

lead to trivial representations such that

, and (ii) there exist some discontinuities on the performance level w.r.t. the values of

[ 27 ].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

The Convex Information Bottleneck Lagrangian

Rodríguez-Gálvez

Thobaben

Skoglund

2020

Entropy

View full text Add to dashboard Cite

The information bottleneck (IB) problem tackles the issue of obtaining relevant compressed representations T of some random variable X for the task of predicting Y . It is defined as a constrained optimization problem which maximizes the information the representation has about the task, I(T ; Y ), while ensuring that a certain level of compression r is achieved (i.e., I(X; T ) ≤ r). For practical reasons, the problem is usually solved by maximizing the IB Lagrangian (i.e., L IB (T ; β) = I(T ; Y ) − βI(X; T )) for many values of β ∈ [0, 1]. Then, the curve of maximal I(T ; Y ) for a given I(X; T ) is drawn and a representation with the desired predictability and compression is selected. It is known when Y is a deterministic function of X, the IB curve cannot be explored and another Lagrangian has been proposed to tackle this problem: the squared IB Lagrangian: L sq-IB (T ; β sq ) = I(T ; Y ) − β sq I(X; T ) 2 . In this paper, we (i) present a general family of Lagrangians which allow for the exploration of the IB curve in all scenarios; (ii) provide the exact one-to-one mapping between the Lagrange multiplier and the desired compression rate r for known IB curve shapes; and (iii) show we can approximately obtain a specific compression level with the convex IB Lagrangian for both known and unknown IB curve shapes. This eliminates the burden of solving the optimization problem for many values of the Lagrange multiplier. That is, we prove that we can solve the original constrained problem with a single optimization. F IB,max (r) = max T ∈∆ {I(T ; Y )} s.t. I(X; T ) ≤ r, ∀r ∈ [0, ∞).(1)Definition 2 (IB curve). The IB curve is the set of points defined by the solutions of F IB,max (r) for varying values of r ∈ [0, ∞).Definition 3 (Information plane). The plane is defined by the axes I(T ; Y ) and I(X; T ).This method has been successfully applied to solve different problems from a variety of domains. For example:• Supervised learning. In supervised learning, we are presented with a set of n pairs of input features and task outputs instances. We seek an approximation of the conditional probability distribution between the task

show abstract

“…The range of the Lagrange multipliers that allow the exploration of the IB curve is contained by

which is also contained by

, where

where

is the derivative of

w.r.t.

evaluated at r,

is the set of possible realizations of X and

and

are defined as in [ 27 ] (Note in [ 27 ] they consider the dual problem (see Appendix G ), so when they refer to

it translates to β in this article). That is,

.…”

Section: The Convex Ib Lagrangianmentioning

confidence: 99%

“…Corollaries 2 and 3 allow us to reduce the range search for

when we want to explore the IB curve. Practically,

might be difficult to calculate so Wu et al [ 27 ] derived an algorithm to approximate it. However, we still recommend setting the numerator to 1 for simplicity.…”

Section: The Convex Ib Lagrangianmentioning

confidence: 99%

“…The main difference comes from the discontinuities in performance for increasing

Section: Experimental Supportmentioning

confidence: 99%

“…The Lagrange multiplier selection is important since (i) sometimes even choices of

lead to trivial representations such that

, and (ii) there exist some discontinuities on the performance level w.r.t. the values of

[ 27 ].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

The Convex Information Bottleneck Lagrangian

Rodríguez-Gálvez

Thobaben

Skoglund

2020

Entropy

View full text Add to dashboard Cite

show abstract

“…This demonstrates the existence of a critical β for each predictive coding scheme, above which m needs to be increased to extract more predictive information and below which additional values of the representation variable encode redundant portions of allele frequency space. While we do not estimate the critical β, approaches to estimating them are presented in [42,43].…”

Section: Evolutionary Dynamicsmentioning

confidence: 99%

Optimal prediction with resource constraints using the information bottleneck

Sachdeva

Mora

Walczak

et al. 2020

Preprint

View full text Add to dashboard Cite

Responding to stimuli requires that organisms encode information about the external world. Not all parts of the signal are important for behavior, and resource limitations demand that signals be compressed. Prediction of the future input is widely beneficial in many biological systems. We compute the trade-offs between representing the past faithfully and predicting the future for input dynamics with different levels of complexity. For motion prediction, we show that, depending on the parameters in the input dynamics, velocity or position coordinates prove more predictive. We identify the properties of global, transferrable strategies for time-varying stimuli. For non-Markovian dynamics we explore the role of long-term memory of the internal representation. Lastly, we show that prediction in evolutionary population dynamics is linked to clustering allele frequencies into non-overlapping memories, revealing a very different prediction strategy from motion prediction.

show abstract