The Information Structuralist

Information and Control in Biology, Part 2: On Bergstrom and Rosvall’s “Transmission Sense of Information”

mraginsky Jan 28, 2022

In the first post of this series, I have outlined the importance of having a proper operational framework in place before starting any talk of information-theoretic quantities. It is, however, a good idea to pin down even provisionally the kind of information-theoretic quantity one hopes to distill out of the operational formulation. This will be … Continue reading "Information and Control in Biology, Part 2: On Bergstrom and Rosvall’s “Transmission Sense of Information”"

Show full content

In the first post of this series, I have outlined the importance of having a proper operational framework in place before starting any talk of information-theoretic quantities. It is, however, a good idea to pin down even provisionally the kind of information-theoretic quantity one hopes to distill out of the operational formulation. This will be the topic of this post, and the starting point will be the paper by Carl Bergstrom and Martin Rosvall on the “transmission sense of information.”

Let me start by quoting the abstract of the paper for the common objections against applying information-theoretic ideas to biology:

Biologists think in terms of information at every level of investigation. Signaling pathways transduce information, cells process information, animal signals convey information. Information flows in ecosystems, information is encoded in the DNA, information is carried by nerve impulses. In some domains the utility of the information concept goes unchallenged: when a brain scientist says that nerves transmit information, nobody balks. But when geneticists or evolutionary biologists use information language in their day-to-day work, a few biologists and many philosophers become anxious about whether this language can be justified as anything more than facile metaphor.

The key claim of Bergstrom and Rosvall is that, by focusing on what they refer to as the transmission sense of information, it is possible to avoid the two pitfalls of mutual information, namely the “shallow” notion of correlation and the so-called parity thesis. In doing so, they emphasize the importance of operational notions in information theory “by taking the viewpoint of a communications engineer and focusing on the decision problem of how information is to be packaged for transport.” While this recognition of the importance of operationalization is not common in biology, I still think that it misses an important point: The relevant decision problem is not that of a communications engineer, it is that of a control engineer, whose primary aim to ensure reliable transmission of the global constraints (formal causes) governing the unfolding of phenotypes (starting with development and continuing on through the organism’s lifetime). In other words, the meaning and use (i.e., the semantics and the pragmatics) of the genome are primary, the symbolic packaging (the syntax) of the genome is secondary. This will be the subject of future posts. Here, my goal is more modest — a critical reconsideration of the Bergstrom and Rosvall paper and an alternative proposal to use directed information, rather than mutual information.

If you are not familiar with directed information, I recommend two old posts of mine (one, two) for the basic background. Briefly, the notion of directed information was proposed by Jim Massey in 1990 in order to analyze systems with feedback by focusing on functional, rather than statistical, dependencies:

… probabilistic dependence is quite distinct from causal dependence. Whether ${X}$ causes ${Y}$ or ${Y}$ causes ${X}$ , the random variables ${X}$ and ${Y}$ will be statistically dependent. Statistical dependence, unlike causality, has no inherent directivity.

This point will be important to us later on.

1. The model

We will consider a very simple model involving the temporal evolution of genotypes and phenotypes in a random environment. We will work in discrete time, using ${t = 1,2,\ldots}$ to index the subsequent generations, so that the triple ${(E_t,G_t,X_t)}$ will consist of the environment, the genotype, and the phenotype, with the obvious temporal ordering:

$\displaystyle E_1,G_1,X_1,E_2,G_2,X_2,\ldots,E_t,G_t,X_t,\ldots .$

Let us consider what happens in ${T}$ generations. To that end, we need to prescribe a joint distribution for ${E^T,G^T,X^T}$ :

$\displaystyle P_{E^T,G^T,X^T}(e^T,g^T,x^T) = \prod^T_{t=1} P_{E_t|E^{t-1}}(e_t|e^{t-1}) P_{G_t|G_{t-1},E_{t-1}}(g_t|g_{t-1},e_{t-1})P_{X_t|G_t,E_t}(x_t|g_t,e_t). \ \ \ \ \ (1)$

The factorization of the joint distribution in (1) encodes some important assumptions, which we need to spell out:

The genotype and the environment are modeled by two discrete-time stochastic processes. Notice that the environment is not necessarily Markovian, but the genome process is a Markov process in a random environment (i.e., at each time ${t}$ , the genome ${G_t}$ is conditionally independent of ${G^{t-2}}$ and ${E^{t-2}}$ given ${G_{t-1}}$ and ${E_{t-1}}$ ). This is a very simple model of asexual reproduction, and the process distribution of ${G^T}$ models mutations (through stochastic dependence of ${G_t}$ on ${G_{t-1}}$ ) and selection (via dependence of ${G_t}$ on ${E_{t-1}}$ ).
The phenotype at ${t}$ is determined by the genotype and by the environment at ${t}$ , but there is no causal influence of the phenotype on the genome (the Central Dogma of molecular biology: information cannot flow from proteins to genes).

Again, I need to emphasize that this is a very simple model (see, e.g., the work of Rivoire and Leibler) that does not account for various feedback mechanisms by which the organisms can act on their environments, such as niche construction.

2. The lack of directivity and the parity thesis

Bergstrom and Rosvall consider what happens at each ${t}$ (they call this the horizontal dimension) as well as the entire trajectory (they call this the vertical dimension); their key claim is that the “transmission sense of information” emerges only when we focus on the latter. Their argument runs as follows. Let us assume for the most part, as they do, that there is no direct feedback from the environment to the genome, so ${P_{G_t|G_{t-1},E_{t-1}} = P_{G_t|G_{t-1}}}$ for each ${t}$ . Consider first the mutual information ${I(G_t;X_t)}$ ; note that, since the genome of any organism has finitely many base pairs, this mutual information is always well-defined and finite. They bring up two objections to meaningful use of this quantity in biology:

Mutual information is a “shallow” notion of statistical correlation (in the words of Massey, it has no inherent directivity). Indeed, since the mutual information is symmetric, we have ${I(G_t;X_t) = I(X_t;G_t)}$ , and this symmetry runs counter to the central dogma (no flow of information from proteins to genes). In fact, we can also include the environment and consider ${I(G_t; X_t,E_t)}$ , which runs into the same issues. There is also the related objection that, whatever information there is in the genes, it is semantic in some sense, whereas the semantic aspects are notably absent from Shannon’s theory.
Once we bring the environment into the picture, we also have to deal with the so-called “parity thesis:” In the terminology of Bayes networks, the random triple ${(E_t,G_t,X_t)}$ forms a collider, i.e., ${G_t}$ and ${E_t}$ are mutually independent, and ${X_t}$ is conditionally dependent on both of them — ${P_{E_t,G_t,X_t}(e_t,g_t,x_t) = P_{E_t}(e_t)P_{G_t}(g_t)P_{X_t|E_t,G_t}(x_t|e_t,g_t)}$ . As a DAG, this Bayes network is symmetric with respect to ${E_t}$ and ${G_t}$ ; as a result, the mutual information fails to single out the DNA (genome) as the information-bearing entity. If we think of the environment as the source of developmental noise, then the picture that emerges fails to distinguish the “control instructions” (the genome) from the “actuation disturbance” (the developmental noise).

I agree with Bergstrom and Rosvall that these objections provide a strong argument against using mutual information in this context. However, this does not mean that there is no alternative information-theoretic construct that is free of the drawbacks of the mutual information. I claim that one has to use directed information instead. This should not be surprising, since we are in fact trying to capture functional, rather than statistical, dependencies. So, let us then compute the directed informations ${I(G_t \rightarrow (E_t,X_t))}$ and ${I((E_t,X_t) \rightarrow G_t)}$ , where the former will quantify the causal influence of the genome on the environment and on the phenotype, whereas the latter will quantify the causal influence of the environment and the phenotype on the genome. In order to do that, we need to compute the directed stochastic kernels ${\vec{P}_{G_t|E_t,X_t}}$ and ${\vec{P}_{E_t,X_t|G_t}}$ . These can be read off directly from the joint distribution

$\displaystyle P_{E_t,G_t,X_t}(e_t,g_t,x_t) = P_{E_t}(e_t)P_{G_t}(g_t) P_{X_t|G_t,E_t}(x_t|g_t,e_t)$

and are given by

$\displaystyle \vec{P}_{G_t|E_t,X_t}(g_t|e_t,x_t) = P_{G_t}(g_t) \qquad \text{and} \qquad \vec{P}_{E_t,X_t|G_t}(e_t,x_t|g_t) = P_{E_t}(e_t)P_{X_t|G_t,E_t}(x_t|g_t,e_t).$

With these, we can compute the corresponding directed informations:

$\displaystyle \begin{array}{rcl} I(G_t \rightarrow (E_t,X_t)) &=& D(P_{G_t,E_t,X_t}\|\vec{P}_{G_t|E_t,X_t} \times P_{E_t,X_t}) \\ &=& {\mathbf E}\left[\log \frac{P_{G_t,E_t,X_t}}{\vec{P}_{G_t|E_t,X_t} P_{E_t,X_t}}\right] \\ &=& {\mathbf E}\left[\log \frac{P_{G_t,E_t,X_t}}{P_{G_t} P_{E_t,X_t}} \right] \\ &=& I(G_t; E_t,X_t), \end{array}$

so this equals just the mutual information between ${G_t}$ and ${(E_t,X_t)}$ , whereas

$\displaystyle \begin{array}{rcl} I((E_t,X_t) \rightarrow G_t) &=& D(P_{E_t,X_t,G_t} \| \vec{P}_{E_t,X_t|G_t} \times P_{G_t}) \\ &=& {\mathbf E}\left[\log \frac{P_{E_t,X_t,G_t}}{\vec{P}_{E_t,X_t|G_t} P_{G_t}} \right] \\ &=& {\mathbf E}\left[\log \frac{P_{E_t,X_t,G_t}}{P_{E_t}P_{X_t|G_t,E_t} P_{G_t}} \right] \\ &=& {\mathbf E}\left[\log \frac{P_{E_t,X_t,G_t}}{P_{E_t,X_t,G_t}} \right] \\ &\equiv& 0. \end{array}$

Since directed information quantifies causal (or functional), rather than statistical, dependencies, this is to be expected — the genes together with the environment exert causal influence on the phenotype, but the phenotype and the environment cannot exert any causal influence on the genes. Here it may also be useful to recall the conservation law that relates mutual information and directed information:

$\displaystyle I(G_t; E_t,X_t) = I(G_t \rightarrow (E_t,X_t)) + I((E_t,X_t) \rightarrow G_t),$

and, since the second term on the right-hand side is identically zero, in this particular instance the mutual information between ${G_t}$ and ${(E_t,X_t)}$ captures all of the causal influence of the genes on the phenotype. This switch of perspective from the mutual information to the directed information dispenses with the shallowness objection.

The “parity thesis” is more interesting, and it is indeed a subtle issue. It is easy to see that we can interchange the roles of ${G_t}$ and ${E_t}$ in the above model and obtain the relations

$\displaystyle I(E_t \rightarrow (G_t,X_t)) = I(E_t; G_t,X_t) \qquad \text{and} \qquad I((G_t,X_t) \rightarrow E_t) = 0.$

This is, as I had mentioned earlier, a simple consequence of the collider structure ${G_t \rightarrow X_t \leftarrow E_t}$ , where ${G_t}$ and ${E_t}$ are two statistically independent objects that jointly influence ${X_t}$ . Bergstrom and Rosvall correctly point out that this is, in fact, a general statement about the interchangeability of the source and the channel noise, as far as the matters of correlation go. Indeed, we can consider a triple of random variables ${(U,W,X)}$ , where ${U}$ is a random source that is being transmitted over a channel, whose output ${X}$ is a noise-corrupted version of ${U}$ , i.e., ${X = f(U,W)}$ for some deterministic function ${f}$ , and ${W}$ is the channel noise which is statistically independent of the source ${U}$ . Again, we have the collider DAG ${U \rightarrow X \leftarrow W}$ , which is symmetric with respect to the roles of ${U}$ and ${W}$ . And indeed, just like in the case with genes, environment, and proteins, there is nothing in the topology of the DAG that privileges the source ${U}$ over the noise ${W}$ . There is no way to circumvent the parity thesis other than simply acknowledging, following Howard Pattee, that everything hinges on making the correct epistemic cut between the system (the interconnection of the controller and of the object of control) and the system’s environment. Indeed, if we recall the instructional or the control role of the genome, then the resolution of the parity thesis should come from the realization that the control law that governs the production of proteins should be (and, in fact, is) a great deal more sensitive to the control instruction itself than to the developmental noise. In other words, it is not enough to look at the mutual information or some such information theoretic quantity; in order to break the symmetry inherent in the parity thesis, we have to look at the relative sensitivity of the conditional distribution ${P_{X_t|G_t = g_t,E_t = e_t}}$ to modifications in ${G_t}$ versus those in ${E_t}$ . Just as a well-designed control policy should be more sensitive to the control input than to random disturbances, we expect the synthesis of proteins to be more sensitive to the instructions in the DNA than to the disturbances coming from the cellular milieu.

Remember, by the way, that in all of the above I have assumed that there is no dependence between the genome and the environment (i.e., genes may undergo mutation, but there is no selection). We can include selection as well, but the resulting mathematics will be a lot more complicated; in particular, we will need to consider so-called causally conditioned directed information, when we condition on the past realizations of ${G}$ and ${E}$ . The main point here is that ${G_t}$ and ${E_t}$ are conditionally independent given ${G^{t-1}}$ and ${E^{t-1}}$ .

3. The vertical dimension and the transmission sense

Again, let’s follow Bergstrom and Rosvall and simplify things for now by assuming no selection, so ${P_{G_t|G_{t-1},E_{t-1}} = P_{G_t|G_{t-1}}}$ for each ${t}$ . At this point, hopefully I have convinced you that we need to look at directed, rather than mutual, information. Now, however, we will work with directed information between processes, i.e., from ${G^T}$ to ${(E^T,X^T)}$ , as well as the one from ${(X^T,E^T)}$ to ${G^T}$ . The relevant directed stochastic kernels are given by

$\displaystyle \begin{array}{rcl} \vec{P}_{E^T|G^T,X^T}(e^T|g^T,x^T) &=& \prod^T_{t=1} P_{E_t|E^{t-1}}(e_t|e^{t-1}) \equiv P_{E^T}(e^T), \\ \vec{P}_{G^T|E^T,X^T}(g^T|e^T,x^T) &=& \prod^T_{t=1} P_{G_t|G_{t-1}}(g_t|g_{t-1}) \equiv P_{G^T}(g^T), \\ \vec{P}_{X^T|G^T,E^T}(x^T|g^T,e^T) &=& \prod^T_{t=1} P_{X_t|G_t,E_t}(x_t|g_t,e_t) \equiv P_{X^T|G^T,E^T}(x^T|g^T,e^T) \end{array}$

Thus, we can compute the directed information

$\displaystyle \begin{array}{rcl} I(G^T \rightarrow (E^T,X^T)) &=& {\mathbf E}\left[\log \frac{P_{G^T,E^T,X^T}}{\vec{P}_{G^T|E^T,X^T} \times P_{E^T,X^T}}\right] \\ &=& {\mathbf E}\left[\log \frac{P_{G^T,E^T,X^T}}{P_{G^T}P_{E^T,X^T}}\right] \\ &=& I(G^T; E^T,X^T), \end{array}$

and, once again, ${I((E^T,X^T) \rightarrow G^T) = 0}$ , so there is no causal influence from ${E^T,X^T}$ to ${G^T}$ . This situation is completely analogous to the one when ${t}$ was fixed. In fact, an entirely analogous argument to the one used for each ${t}$ shows that we can interchange the roles of ${G^T}$ and ${E^T}$ , so that

$\displaystyle I(E^T \rightarrow (G^T,X^T)) = I(E^T; G^T,X^T) \quad \text{and} \quad I((G^T,X^T) \rightarrow E^T) = 0$

so focusing on the temporal unfolding of genes, proteins, and environments does not rid us of the parity thesis. Since both genes and environments affect the phenotype, this is not surprising at all.

Bergstrom and Rosvall’s resolution is to cut through all of this by focusing on just the genome process ${G^T}$ . This is what they mean by the transmission sense of information from the viewpoint of a fictitious communications engineer, the role they ascribe to natural selection. Since in their basic model there is no feedback from the phenotype to the environment, we may as well focus on just ${G^T}$ and ${E^T}$ . It is easy to marginalize ${X^T}$ away in (1) to obtain the following factorization of the joint probability law of ${G^T}$ and ${E^T}$ :

$\displaystyle P_{G^T,E^T}(g^T,e^T) = \prod^T_{t=1} P_{E_t|E^{t-1}}(e_t|e_{t-1}) P_{G_t|G_{t-1},E_{t-1}}(g_t|g_{t-1},e_{t-1}). \ \ \ \ \ (2)$

From the viewpoint of a communications engineer, Equation (2) does indeed describe a communication channel: The environment ${E^T}$ is a fixed “noise process” that plays some part in the evolution of ${G^T}$ . At each time ${t}$ , the current “output” genome ${G_t}$ is determined stochastically by the previous “input” genome ${G_{t-1}}$ and by the environmental noise ${E_{t-1}}$ . If we further marginalize out the environment ${E^T}$ , we will obtain the process law of ${G^T}$ . The central claim of Bergstrom and Rosvall is that the transmission sense of biological information is reflected in the structural and the distributional properties of this process law, especially if one were to compare it to a “high-variety” random object like the environmental noise process ${E^T}$ . The main idea is that the combinatorial structure of the genomic alphabet and the information-theoretic characteristics of the proobability law of ${G^T}$ point to the genome’s function, which is to reduce uncertainty.

Now, uncertainty about what? In order to answer this question, we need to pose an operational interpretation, an optimization problem that our fictitious engineer must solve. And, as I have stressed before, the value of this optimization problem cannot be stated a priori in information-theoretic terms; any such quantity has to be distilled from the nature of the original problem and from the interplay of the source, the channel, and the overall goals. Bergstrom and Rosvall do not give such a formulation, but it is possible to come up with one — see, e.g., the works by Rivoire and Leibler given in the references (I may write a blog post on their work in the future).

I want to close with a critical look at the following quote from the Bergstrom and Rosvall paper:

When information theorists think about coding, they are not thinking about semantic properties. All of the semantic properties are stuffed into the codebook, the interface between source structure and channel structure, which to information theorists is as interesting as a phonebook is to sociologists. When an information theorist says “Tell me how data stream A codes for message set B,” she is not asking you to read her the codebook. She is asking you to tell her about compression, channel capacity, distortion structure, redundancy, and so forth. That is, she wants to know how the structure of the code reflects the statistical properties of the data source and the channel with respect to the decision problem of effectively packaging information for transport.

While the last part is correct (about the structure of the code reflecting the statistics of the source and the channel with respect to the underlying operational criterion), the opening claim about semantics is not entirely accurate. I have touched upon some of these issues in a Twitter thread:

In this thread, I will argue that the conventional wisdom that Shannon’s information theory is all about syntax and not about semantics stems from superficial reading. On the contrary, even his 1948 BSTJ paper is already concerned with syntax, semantics, *and* pragmatics. 1/14

— Maxim Raginsky (@mraginsky) January 29, 2021

I will probably expand that thread into a blog post soon, but here I just want to emphasize the basic nature of the decision problem faced by a communications engineer. The goal is to convey some information about a source ${S}$ through a given noisy channel. The relevant random objects are then ${S}$ (the source itself), ${W}$ (the discrete message containing a suitably compressed summary of the source, ${X}$ (the input to the noisy channel), ${Y}$ (the output of the noisy channel), and, finally, ${A}$ (the action taken by the receiver on the basis of ${Y}$ ). The joint probability law of all these random objects factorizes as

$\displaystyle P_{SWXYA} = P_S P_{W|S} P_{X|W} P_{Y|X} P_{A|Y}$

(and here I am simplifying quite a bit and not worrying about time structure, feedback, etc). Of the stochastic kernels making up this factorization, the two things that are not under the control of the system designer are the source probability law ${P_S}$ and the channel transition law ${P_{Y|X}}$ ; everything else is a decision variable — the source code ${P_{|W|S}}$ , the channel code ${P_{X|W}}$ , and the receiver’s action policy ${P_{A|Y}}$ . The overall optimization problem would then be to choose these ingredients to minimize, say, the expected cost ${{\mathbf E}[c(S,A)]}$ , possibly subject to additional constraints, such as input costs for the channel. In some sense, the goal of the system designer is to choose the missing stochastic kernels for an incomplete specification given by just ${P_S}$ and ${P_{Y|X}}$ , and some of these stochastic kernels do have a bearing on the semantics and on the pragmatics of the problem: both the source ${S}$ and the action ${A}$ reflect material events, while all the other objects, i.e., ${W,X,Y}$ , can be simply thought of as symbols that can be manipulated purely syntacticallly, without much thought given to their meaning (i.e., how they relate to ${S}$ and to ${A}$ ) or to their use (why one would want to take an action pertaining to the source in the first place). While the problem of designing channel codes can be plausibly thought of as free of semantics, both the source code and the action policy are interfaces between matter and symbols, and are thus necessarily semantic (cf. Howard Pattee’s work on this).

It seems that the transmission sense of Bergstrom and Rosvall could pertain to the structure of the channel input and output alphabets and to the distributional properties of ${X}$ once all the stochastic kernels have been specified. Indeed, in some settings it is possible to argue that the distribution of ${X}$ should be as close as possible to capacity-achieving for the channel ${X \rightarrow Y}$ , i.e., it should maximize the mutual information ${I(X;Y)}$ , and therefore depend only on the channel transition law ${P_{Y|X}}$ and not on any semantic or pragmatic aspects of the overall problem. This, however, will only be true under very special conditions (in particular, we would need to ensure that the source-channel separation principle is valid), and at any rate it is not clear to me that the relevant setting is communication and not control (orchestration of development and growth via formal causes).

References

Carl Bergstrom and Martin Rosvall, “The transmission sense of information”
James Massey, “Causality, feedback, and directed information”
Robert Cogburn, “Markov Chains in random environments: The case of Markovian environments”
Olivier Rivoire and Stanislas Leibler, “The value of information for populations in random environments”
Olivier Rivoire, “Information in models of evolutionary dynamics”
Howard Pattee, “The physics of symbols: bridging the epistemic cut”
Sridhar Vembu, Sergio Verdú, and Yossef Steinberg, “The source-channel separation theorem revisited”
Gérard Battail, “Error-correcting codes and information in biology”

http://infostructuralist.wordpress.com/?p=769

Extensions

Information and Control in Biology, Part 1: Preliminary Considerations

mraginsky Jan 11, 2022

Disclaimer: I am not a biologist, but I have become interested in biology and related matters over the past couple of years. One reason is obviously the pandemic, so the talk of biology, viruses, mRNA, and the like is everywhere. The other, main, reason is that I think we will not get anywhere interesting in … Continue reading "Information and Control in Biology, Part 1: Preliminary Considerations"

Show full content

Disclaimer: I am not a biologist, but I have become interested in biology and related matters over the past couple of years. One reason is obviously the pandemic, so the talk of biology, viruses, mRNA, and the like is everywhere. The other, main, reason is that I think we will not get anywhere interesting in AI unless we understand the concepts of autonomy, self-directedness, integration, and adaptation in even very simple biological systems.

This will be the first in a series of posts that are meant as an extended response to Yohan John‘s old post over at 3 Quarks Daily.

Yohan writes:

We are increasingly employing information as an explanation of phenomena outside the world of culture and technology — as the central metaphor with which to talk about the nature of life and mind. Molecular biology, for instance, tells us how genetic information is transferred from one generation to the next, and from one cell to the next. And neuroscience is trying to tell us how information from the external world and the body percolates through the brain, influencing behavior and giving rise to conscious experience.

But do we really know what information is in the first place? And is it really a helpful way to think about biological phenomena? I’d like to argue that explanations of natural phenomena that involve information make inappropriate use of our latent, unexamined intuitions about inter-personal communication, blurring the line between what we understand and what we don’t quite have a grip on yet.

Similar sentiments are quoted by Carl Bergstrom and Martin Rosvall:

Biologists think in terms of information at every level of investigation. Signaling pathways transduce information, cells process information, animal signals convey information. Information flows in ecosystems, information is encoded in the DNA, information is carried by nerve impulses. In some domains the utility of the information concept goes unchallenged: when a brain scientist says that nerves transmit information, nobody balks. But when geneticists or evolutionary biologists use information language in their day-to-day work, a few biologists and many philosophers become anxious about whether this language can be justified as anything more than facile metaphor.

Yohan argues that information theory is, on the whole, not an appropriate framework with which to reason about biological information. Carl and Martin argue otherwise, but propose their own framework, what they refer to as the transmission sense of information, which purportedly resolves the issues that trouble “a few biologists and many philosophers.” My goal in this series of posts is to argue that information theory can indeed be applied to biology, but that its proper application needs to be built up from first principles, starting with a serious engagement with its entire conceptual framework. Moreover, I agree with Yohan that digital communication is not the right conceptual schema; instead, we should be talking about control, programmability, and behaviors.

1. Operationalization has to come first

In one big way, Yohan is right — one has to search far and wide for applications of information theory to biology that rise above the level of facile metaphors. What he does not talk about is the reason, which is essentially the same reason why many attempts to apply of information theory outside its original domain of communication systems largely fail. This has to do with the fact that, in information theory, the starting point is always an operational formulation in terms of various performance criteria that are not phrased in information-theoretic language. For instance, if we talk about data compression, then we bring in the considerations of code length, probability of error, compression ratio, etc.; if we talk about channel coding, then we are interested in the maximal rate of transmission subject to a given level of bit or block errors; if we talk about source coding, then we care about expected distortion, codebook size, etc. Note that none of these criteria are phrased in terms of information-theoretic quantities like entropy, conditional entropy, or mutual information. This is not accidental, and the whole enterprise of information theory is to express the fundamental limits of communication systems quantitatively in terms of these quantities. In other words, the pragmatics of communication comes first, the information-theoretic analysis comes second.

When it comes to applying information theory to biology, this important point somehow gets lost in translation, although some biologists do emphasize it — see, e.g., this paper by Olivier Rivoire. Bergstrom and Rosvall recognize this as well, but, in my opinion, do not explore the implications deeply enough. So, if we do want to get anywhere, we have to start with an operational formulation. In the context of molecular and evolutionary biology, which is primarily what I am interested in here, we will need to formulate our operational criteria at the level of an organism as the subject and object of evolution.

2. Genotypes as behaviors, phenotypes as constrained trajectories

The temptation to apply information theory to molecular and evolutionary biology is understandable: The term “genetic code” all but forces one to think in terms of information in the sense of Shannon! However, before we begin, we need to question the communication framework itself. Even granting the language of codes, we need to identify the problem to which the genetic code is a solution. Bergstrom and Rosvall argue that the problem is one of reliable and maximally efficient transmission of the genome from one generation to the next. However, if that were the only problem, this does not answer C.H. Waddington’s question “why animals should have evolved all sorts of highly adaptive structures to do unlikely things instead of simply being reduced to bags of eggs and sperm like certain parasitic worms.” In other words, we need to discuss the genotype and the phenotype. Once we bring the phenotype into the picture, the communication metaphor dissolves in favor of one of control and homeorhesis. In this interpretation, again due to Waddington, the genome is not just a message to be transmitted through time and space, it is a set of instructions (or a program) which, together with some aspects of the organism’s environment and developmental noise, will be used to construct (or realize) the organism’s phenotype out of the set of potential phenotypes. Indeed, biological organisms are highly constrained, programmable open systems which, throughout the course of their development and life, maintain certain relations with their environment. This includes not only the environment affecting the organism, but also the organism affecting and constraining its environment (niche construction).

This suggests thinking about genomes as specifying a behavior in the sense of Jan Willems. To get the main point across, it suffices to paint a very broad picture. Let ${g}$ denote the genome, ${x}$ the configuration of the organism, ${e}$ the configuration of the environment, and ${w}$ the developmental noise. Then the role of the genome is to act as the organism’s formal cause, i.e., to constrain the relation between all the other variables. In other words, treating everything except the genome as a trajectory in time, we can formalize the relations between the organism and its environment as a certain subset ${{\mathcal B}_g}$ of the product space

$\displaystyle X^T \times E^T \times W^T,$

where, e.g., ${X^T}$ denotes the set of all trajectories ${(x_t)_{t \in T}}$ through the configuration space of the organism over its lifetime (here I am, of course, drastically oversimplifying things, ignoring, for instance, the fact that the lifetime of the organism is generally determined by its trajectory and its interaction with the environment). To be a bit more concrete, we may say that a tuple of trajectories ${(x^T,e^T,w^T)}$ is in ${{\mathcal B}_g}$ if there exist suitably defined functions ${f_1}$ , ${f_2}$ , and ${f_3}$ , such that

$\displaystyle x^T = f_1(g,x^T,e^T,w^T)$

(i.e., the configuration of the organism is determined by its genome, by its environment, and by developmental noise, possibly with feedback from the organism, subject to appropriate causality restrictions) and

$\displaystyle f_2(e^T) = f_3(x^T,e^T)$

(i.e., some aspects of the environment are shaped by the organism by interaction or coupling via the phenotype but not directly through the genotype). Then the phenotype is any trajectory ${x^T}$ that is compatible with these constraints, i.e., which could arise in this fashion for some realization of the environment ${e^T}$ and the developmental noise ${w^T}$ . Notice that this description introduces mutual constraints between the organism and its environment. So, if one were to think of the genome as a message to be transmitted in space and in time, then this message is a program (formal cause) of the organism, but it does not (indeed, cannot!) contain all the information about the phenotype. The missing information is contained in the environment and in the developmental noise.

Note, by the way, that there is nothing specifically biological in the above description. Indeed, all this time I could have been talking about a Turing machine, where ${g}$ describes its state transition and output functions, ${x}$ is its internal state, ${e}$ is the content of its tape, and ${w}$ is any randomness in the state transitions. In order to bring the biology back into the picture, we could follow Robert Rosen‘s characterization of biological organisms in terms of closure with respect to efficient causation, i.e., the organism’s internal ability to maintain certain relations and constraints with its environment (which includes itself). However, this illustration, I think, provides a correction to the naive picture of the genome “computing” the organism just like a Turing machine computing a function, rightly criticized by Yohan John. A better picture is the one above: the genome prescribes the operation of the Turing machine, the environment provides the initial input and records the output of the Turing machine, and the state sequence generated epigenetically for that particular environment is the phenotype.

This also addresses, I believe, Waddington’s objection to the use of information theory in biology on the grounds of the data processing inequality:

It seems quite obvious to common sense that a rabbit running around in a field contains a much greater `amount of variety’ than a newly fertilized rabbit’s egg. How are we to deal with the situation in terms of an `information theory’ whose basic tenet is that information cannot be gained?

The amount of information contained in the phenotype is, of course, finite, but it only acts as the formal cause that determines the entire behavior ${{\cal B}_g}$ ; it cannot by itself select a particular trajectory in the behavior. Hence, if we wish to operationalize the problem of reliable transmission of genetic information, we need to phrase it in terms of criteria pertainig to preserving the behavior as a whole. I will take this up, in a simplified form, in subsequent posts; the next post, however, will mostly be about the Bergstrom-Rosvall paper on the transmission sense of information.

References

Yohan John, “How informative is the concept of biological information?”
Carl Bergstrom and Martin Rosvall, “The transmission sense of information”
Olivier Rivoire, “Information in models of evolutionary dynamics”
Richard Lewontin, “The organism as the subject and object of evolution”
Conrad H. Waddington, “The basic ideas of biology”
Yuuki Matsushita and Kunihiko Kaneko, “Homeorhesis in Waddington’s landscape by epigenetic feedback regulation”
Jan Willems, “The behavioral approach to open and interconnected systems”
Robert Rosen, Life Itself: A Comprehensive Inquiry into the Nature, Origin and Fabrication of Life

http://infostructuralist.wordpress.com/?p=758

Extensions

Sampling Using Diffusion Processes, from Langevin to Schrödinger

mraginsky Sep 2, 2021

These notes are based on the tutorial I gave at the Geometric Methods in Optimization and Sampling Boot Camp at the Simons Institute in Berkeley. Suppose we wish to obtain samples from some probability measure on . If has a sufficiently well-behaved density with respect to the Lebesgue measure, i.e., , then we can use … Continue reading "Sampling Using Diffusion Processes, from Langevin to Schrödinger"

Show full content

These notes are based on the tutorial I gave at the Geometric Methods in Optimization and Sampling Boot Camp at the Simons Institute in Berkeley.

Suppose we wish to obtain samples from some probability measure ${\mu}$ on ${{\mathbb R}^d}$ . If ${\mu}$ has a sufficiently well-behaved density ${f}$ with respect to the Lebesgue measure, i.e., ${\mu(dx) = f(x) dx}$ , then we can use the (overdamped) continuous-time Langevin dynamics, governed by the Ito stochastic differential equation (SDE)

$\displaystyle d X_t = \frac{1}{2}\nabla \log f(X_t) dt + dW_t, \qquad t \ge 0 \ \ \ \ \ (1)$

where the initial condition ${X_0}$ is generated according to some probability law ${\mu_0}$ , and ${(W_t)_{t \ge 0}}$ is the standard ${d}$ -dimensional Brownian motion. Let ${\mu_t}$ denote the probability law of ${X_t}$ . Then, under appropriate regularity conditions on ${f}$ , one can establish the following:

${\mu}$ is the unique invariant distribution of (1), i.e., if ${\mu_0 = \mu}$ , then ${\mu_t = \mu}$ for all ${t}$ .
${\mu_t}$ converges to ${\mu}$ in a suitable sense as ${t \rightarrow \infty}$ — in fact, it is often possible to show that there exists a constant ${c > 0}$ that depends only on ${\mu}$ , such that one has the exponential convergence to equilibrium
$\displaystyle {\rm dist}(\mu_t, \mu) \le e^{-t/c}{\rm dist}(\mu_0, \mu)$

for some distance between probability measures on ${{\mathbb R}^d}$ .

In this sense, the Langevin process (1) gives only approximate samples from ${\mu}$ . I would like to discuss an alternative approach that uses diffusion processes to obtain exact samples in finite time. This approach is based on ideas that appeared in two papers from the 1930s by Erwin Schrödinger in the context of physics, and is now referred to as the Schrödinger bridge problem.

We will now assume that the target probability measure ${\mu}$ has a density ${f}$ with respect to the canonical Gaussian measure on ${{\mathbb R}^d}$ , i.e.,

$\displaystyle \mu(dx) = f(x)\gamma(dx), \qquad \gamma(dx) := \frac{1}{(2\pi)^{d/2}} e^{-|x|^2/2}dx.$

We will also assume that ${f}$ is everywhere strictly positive. Our goal is to construct an Ito process governed by the SDE

$\displaystyle dX_t = b(X_t,t)dt + dW_t, \qquad t \in [0,1] \ \ \ \ \ (2)$

starting at ${X_0 = 0}$ , such that:

${X_1}$ has probability law ${\mu}$ ;
${Q}$ , the probability law of the process ${(X_t)_{t \in [0,1]}}$ , is optimal in the sense that, among all the diffusion processes with values in ${{\mathbb R}^d}$ starting at the origin at ${t=0}$ and having the marginal distribution ${\mu}$ at ${t=1}$ , it remains “as close as possible” to the standard Brownian motion in the sense to be made precise below.

Note that the drift in (2) is time-dependent. This problem has been studied from many angles, and the optimal drift has an explicit form. My goal here is to give what is, to the best of my knowledge, a somewhat new and rather transparent derivation of the solution to the Schrödinger bridge problem. The main idea is based on a rather neat trick that I picked up from a paper by Hiroshi Ezawa, John Klauder, and Larry Shepp (although it was already present in some form in an earlier short paper by Václav Beneš and Shepp). The presentation will be rather informal, but every step can be made rigorous.

1. The Schrödinger bridge problem

The basic problem can be stated as follows. Let ${\Omega}$ denote the space of continuous functions (paths) from ${[0,1]}$ into ${{\mathbb R}^d}$ , and let ${W = (W_t)_{t \in [0,1]}}$ denote the canonical coordinate process on ${\Omega}$ : For any ${\omega \in \Omega}$ , ${W_t(\omega) := \omega(t)}$ . Let ${P}$ denote the Wiener measure on ${\Omega}$ , under which ${W}$ is the standard ${d}$ -dimensional Brownian motion on ${[0,1]}$ . Given any probability measure ${Q}$ on ${\Omega}$ , we will denote by ${Q_t}$ the marginal probability law of the value of the path at time ${t}$ . With this, we consider the set

$\displaystyle {\cal M}^\mu := \left\{ Q : Q_0 = \delta_0 \text{ and } Q_1 = \mu \right\},$

where ${\delta_0}$ denotes the Dirac probability measure on ${{\mathbb R}^d}$ centered at the origin. Our goal is to find

$\displaystyle Q^\mu = \mathop{\text{arg\,min}}_{Q \in {\cal M}^\mu} D(Q \| P),$

where ${D(\cdot \| \cdot)}$ denotes the relative entropy between two probability measures on ${\Omega}$ . We will need to do two things: (a) show that ${Q^\mu}$ exists and is unique, and (b) show that it is the probability law of a diffusion process governed by an Ito SDE of the form (2).

To take care of (a), all we need is the chain rule for the relative entropy. Fix an arbitrary ${Q \in {\cal M}^\mu}$ ; without loss of generality, we can assume that ${Q}$ is absolutely continuous with respect to the Wiener measure ${P}$ , i.e., ${D(Q \| P ) < \infty}$ . The chain rule then gives

$\displaystyle D(Q \| P) = D(Q_1 \| P_1) + \int_{{\mathbb R}^d} P_1(dx) D(Q^x \| P^x), \ \ \ \ \ (3)$

where ${Q^x}$ (respectively, ${P^x}$ ) denotes the conditional probability law of ${X = (X_t)_{t \in [0,1]}}$ given ${X_1 = x}$ under ${Q}$ (respectively, ${P}$ ). The conditional probability measure ${P^x}$ is well-known, and gives the probability law of the Brownian bridge process, i.e., the standard Brownian motion pinned to the point ${x}$ at ${t=1}$ . Now, since ${Q \in {\cal M}^\mu}$ , we have ${Q_1 = \mu}$ , whereas ${P_1 = \gamma}$ for the Wiener measure ${P}$ . Thus, we have the following for the right-hand side of (3):

$\displaystyle \begin{array}{rcl} D(Q_1 \| P_1) + \int_{{\mathbb R}^d} P_1(dx) D(Q^x \| P^x) &=& D(\mu \| \gamma) + \int_{{\mathbb R}^d}\gamma(dx) D(Q^x \| P^x) \\ &\ge& D(\mu \| \gamma), \end{array}$

where equality holds if and only if ${Q^x = P^x}$ almost everywhere. This immediately implies two things:

${D(\mu \| \gamma) = \inf_{Q \in {\cal M}^\mu} D(Q \| P)}$ ;
the above infimum is attained by the probability measure
$\displaystyle Q^\mu(\cdot) = \int_{{\mathbb R}^d} \mu(dx) P^x(\cdot), \ \ \ \ \ (4)$

i.e., by the ${\mu}$ -mixture of the Brownian bridge processes ${P^x}$ .

From (4), it is easy to obtain an explicit expression for the Radon–Nikodym derivative ${\frac{d Q^\mu}{d P}}$ : Let ${F}$ be an arbitrary bounded measurable real-valued function on ${\Omega}$ . Then, using the fact that ${d\mu = fd\gamma}$ , (4), and the law of iterated expectation, we can write

$\displaystyle \begin{array}{rcl} {\mathbb E}_{Q^\mu}[F(X)] &=& {\mathbb E}_\mu[{\mathbb E}_{Q^\mu}[F(X)|X_1]] \\ &=& {\mathbb E}_\mu[{\mathbb E}_P[F(X)|X_1]] \\ &=& {\mathbb E}_\gamma[f(X_1){\mathbb E}_P[F(X)|X_1]] \\ &=& {\mathbb E}_\gamma[E_P[f(X_1)F(X)|X_1]] \\ &=& {\mathbb E}_P[f(W_1)F(W)], \end{array}$

where the first equation is due to the fact that ${Q^\mu_1 = \mu}$ , in the second equation we have used the fact that the conditional laws ${Q^\mu[\cdot|X_1 = x]}$ and ${P[\cdot|X_1 = x]}$ coincide, and in the last equation ${W = (W_t)_{t \in [0,1]}}$ denotes the standard Brownian motion process with process law ${P}$ . That is, the Radon–Nikodym derivative of ${Q^\mu}$ with respect to ${P}$ , as a function of the path ${W}$ , depends only on the terminal point ${W_1}$ and takes the form

$\displaystyle \frac{dQ^\mu}{dP}(W) = f(W_1), \ \ \ \ \ (5)$

where, as we recall, ${f}$ denotes the density of the target measure ${\mu}$ with respect to the standard Gaussian measure ${\gamma}$ on ${{\mathbb R}^d}$ . Thus, the optimal measure ${Q^\mu}$ is simply the law of the standard ${d}$ -dimensional Brownian motion on ${[0,1]}$ , “conditioned” to have law ${d\mu = fd\gamma}$ at ${t=1}$ .

2. The gradient ansatz and the Girsanov transformation

In principle, (5) already tells us that we can obtain samples from ${\mu}$ by “reweighting” the Brownian paths ${W}$ using the weight factor ${f(W_1)}$ that depends only on ${W_1}$ . What we are after, however, is something different: Instead of reweighting, we would like to construct a “transport” map ${T : \Omega \rightarrow \Omega}$ such that ${Q^\mu = T_*P}$ , i.e., the optimal measure ${Q^\mu}$ is the pushforward of the Wiener measure by ${T}$ :

$\displaystyle {\mathbb E}_{Q^\mu}[F(X)] = {\mathbb E}_P[F\circ T(W)]$

for any bounded measurable ${F : \Omega \rightarrow {\mathbb R}}$ . Moreover, we want this ${T}$ to be such that the “transported” process ${X := T \circ W}$ is an Ito diffusion process. In order to construct such a map, we will use a fundamental result in stochastic calculus, the Girsanov theorem, which gives the necessary and sufficient condition for a probability measure ${Q}$ on ${\Omega}$ to be absolutely continuous with respect to the Wiener measure ${P}$ .

In order to make use of it, we will first make the gradient ansatz, i.e., we will assume that the diffusion process we seek is governed by an Ito SDE of the form

$\displaystyle dX_t = -\nabla v(X_t,t)dt + dW_t, \qquad X_0 = 0, t \in [0,1] \ \ \ \ \ (6)$

where ${\nabla}$ denotes the gradient with respect to the space variable. Then we will show that we can choose a suitable function ${v : {\mathbb R}^d \times [0,1] \rightarrow {\mathbb R}}$ which is twice continuously differentiable in ${x}$ and once continuously differentiable in ${t}$ , such that the probablity law of the resulting process ${X}$ will be given by ${Q^\mu}$ . To motivate the introduction of the gradient ansatz, it is useful to compare the SDE (6) with the Langevin SDE (1) — the latter has a time-invariant drift given by the gradient of the log-density of ${\mu}$ with respect to the Lebesgue measure, while in the former we are using a time-varying drift since our goal is to obtain a sample from ${\mu}$ at time ${t=1}$ , and intuition suggests that a nonstationary process is needed to make this work.

At any rate, denoting by ${Q}$ the probability law of the process (6), we can use the Girsanov theorem to write down the Radon–Nikodym derivative of ${Q}$ with respect to ${P}$ :

$\displaystyle \frac{dQ}{dP}(W) = \exp\left\{-\int^1_0 \langle \nabla v(W_t,t), dW_t \rangle - \frac{1}{2}\int^1_0 |\nabla v(W_t,t)|^2 dt\right\}. \ \ \ \ \ (7)$

Comparing (7) with (5), the path forward is now clear: We need to show that ${v}$ can be chosen in such a way that the right-hand side of (7) is equal to ${f(W_1)}$ . This is where the Ezawa–Klauder–Shepp trick comes in. The key idea is to eliminate the Ito integral in (7) and then to reduce the problem to solving a suitable partial differential equation (PDE).

Let us define the process ${(V_t)_{t \in [0,1]}}$ by ${V_t := v(W_t,t)}$ . Ito’s lemma then gives

$\displaystyle dV_t = \left(\frac{\partial}{\partial t}v(W_t,t) + \frac{1}{2}\Delta v(W_t,t)\right) dt + \langle \nabla v(W_t,t), dW_t \rangle,$

where ${\Delta = \nabla \cdot \nabla}$ is the Laplacian. Integrating and rearranging, we obtain

$\displaystyle -\int^1_0 \langle \nabla v(W_t,t), dW_t \rangle = V_0 - V_1 + \int^1_0 \left(\frac{\partial}{\partial t}v(W_t,t) + \frac{1}{2}\Delta v(W_t,t)\right) dt.$

Substituting this into (7) and using the definition of ${V_t}$ gives

$\displaystyle \frac{dQ}{dP}(W) = \exp\left\{v(0,0) - v(W_1,1) + \int^1_0 \left( \frac{\partial}{\partial t}v(W_t,t) + \frac{1}{2}\Delta v(W_t,t) -\frac{1}{2} |\nabla v(W_t,t)|^2\right) dt\right\}.$

If we are to have any hope of reducing the right-hand side to an expression that depends only on ${W_1}$ , we need to ensure that the integral over ${t}$ is identically zero.

3. The PDE connection

One way of guaranteeing this is to assume that ${v}$ solves the PDE

$\displaystyle \frac{\partial}{\partial t}v(x,t) + \frac{1}{2}\Delta v(x,t) = \frac{1}{2} |\nabla v(x,t)|^2 \ \ \ \ \ (8)$

for all ${(x,t) \in {\mathbb R}^d \times [0,1]}$ , subject to the terminal condition ${v(x,1) = - \log \psi(x)}$ for some strictly positive function ${\psi : {\mathbb R}^d \rightarrow (0,\infty)}$ to be chosen later. Assuming that such a ${v}$ exists, we will then have

$\displaystyle \frac{dQ}{dP}(W) = e^{v(0,0)}\psi(W_1) \ \ \ \ \ (9)$

which means that ${\psi}$ should be chosen in such a way that ${e^{v(0,0)}\psi(x) = f(x)}$ for all ${x \in {\mathbb R}^d}$ . Now, the PDE (8) is nonlinear in ${v}$ due to the presence of the squared norm of the gradient of ${v}$ on the right-hand side. However, it is known from the theory of PDEs that we can convert it into a linear PDE that can be solved explicitly; this is accomplished by making the logarithmic (or Cole–Hopf) transformation ${h(x,t) := e^{-v(x,t)}}$ . It is then a straightforward if tedious exercise in multivariate calculus to show that the function ${h}$ solves a much nicer linear PDE

$\displaystyle \frac{\partial}{\partial t}h(x,t) + \frac{1}{2}\Delta h(x,t) = 0 \ \ \ \ \ (10)$

on ${{\mathbb R}^d \times [0,1]}$ subject to the terminal condition ${h(x,1) = \psi(x)}$ . The solution of (10) is given by the Feynman–Kac formula, one of the remarkable connections between the theory of (parabolic) PDEs and diffusion processes: For any ${x \in {\mathbb R}^d}$ and ${t \in [0,1]}$ ,

$\displaystyle h(x,t) = {\mathbb E}_P[\psi(W_1)|W_t = x]. \ \ \ \ \ (11)$

The proof of (11) is a simple application of Ito’s lemma: for any ${t \in [0,1]}$ ,

$\displaystyle h(W_1,1) = h(W_t,t) + \int^1_t \left( \frac{\partial}{\partial s}h(W_s,s) + \frac{1}{2}\Delta h(W_s,s)\right)ds + \int^1_t \langle \nabla h(W_s,s), dW_s \rangle.$

Taking the conditional expectation given ${W_t = x}$ and using the fact that ${h}$ solves (10), we obtain (11).

Going back to (8), we can now write

$\displaystyle v(x,t) = - \log h(x,t) = - \log {\mathbb E}_P[\psi(W_1)|W_t = x],$

and, in particular, ${v(0,0) = - \log {\mathbb E}_P[\psi(W_1)|W_0 = 0] = -\log {\mathbb E}_P[\psi(W_1)] = - \log {\mathbb E}_\gamma[\psi]}$ . Substituting all of this into (9) gives

$\displaystyle \frac{dQ}{dP}(W) = \frac{h(W_1,1)}{h(0,0)} = \frac{\psi(W_1)}{{\mathbb E}_\gamma[\psi]},$

which means that we should evidently choose ${\psi = f}$ so that ${Q = Q^\mu}$ . Moreover, using the known Gaussian form of the transition probability densities of the Brownian motion, we can give the drift in (6) explicitly as

$\displaystyle \begin{array}{rcl} -\nabla v(x,t) &=& \nabla \log {\mathbb E}_\gamma[f(x + \sqrt{1-t}Z)] \\ &=& \nabla \log \int_{{\mathbb R}^d} f(x + \sqrt{1-t}z) e^{-|z|^2/2} dz \end{array}$

where ${Z}$ is distributed according to the canonical Gaussian measure ${\gamma}$ . This is known as the Föllmer drift (see this paper by Ronen Eldan and James Lee for more information and for further references).

4. The optimal control formulation

Those of you who are familiar with the theory of controlled diffusion processes should recognize the PDE (8) as the Hamilton–Jacobi–Bellman equation for the value function of a certain finite-horizon optimal stochastic control problem. While I do not want to get too much into details, I will sketch the basic idea. The first step is to use the identity

$\displaystyle -\frac{1}{2}|\xi|^2 = \min_{u \in {\mathbb R}^d} \left\{ \langle \xi,u \rangle + \frac{1}{2}|u|^2\right\},$

where the minimum is achieved uniquely by ${u^* = - \xi}$ , to rewrite (8) in the equivalent form

$\displaystyle \frac{\partial}{\partial t}v(x,t) + \frac{1}{2}\Delta v(x,t) = - \min_{u \in {\mathbb R}^d} \left\{ \langle \nabla v(x,t), u \rangle + \frac{1}{2}|u|^2\right\} \ \ \ \ \ (12)$

for ${(x,t) \in {\mathbb R}^d \times [0,1]}$ with ${v(x,1) = -\log f(x)}$ . The stochastic control connection is as follows: We seek an adapted control process ${u = (u_t)_{t \in [0,1]}}$ to minimize the expected cost

$\displaystyle J(u) := {\mathbb E}^u \left[ \frac{1}{2}\int^1_0 |u_t|^2 dt - \log f(X^u_1) \right], \ \ \ \ \ (13)$

where ${{\mathbb E}^u[\cdot]}$ denotes expectation with respect to the controlled diffusion process

$\displaystyle dX^u_t = u_t dt + dW_t, \qquad X^u_0 = 0, t \in [0,1]. \ \ \ \ \ (14)$

Here, the integral ${\frac{1}{2}\int^1_0 |u_t|^2 dt}$ is the control cost, while ${-\log f(X^u_1)}$ is the terminal cost at ${t=1}$ . It can then be shown using the so-called verification theorem in the theory of controlled diffusions that the solution ${v(x,t)}$ of the PDE (12) gives the value function (or the optimal cost-to-go function)

$\displaystyle v(x,t) := \min_{u} {\mathbb E}^u \Bigg[\frac{1}{2}\int^1_t |u_s|^2 ds - \log f(X^u_1) \Bigg|X^u_t = x\Bigg],$

so that, in particular, ${v(0,0) \equiv \min_u J(u)}$ . Here, the idea is that we add the drift ${u}$ to the standard Brownian motion ${W}$ to steer the process towards the target distribution ${\mu}$ at ${t=1}$ , while keeping the total “control effort” small. Of course, from the preceding derivations we know that ${v(0,0) = - \log {\mathbb E}_\gamma[f(Z)] = 0}$ , and that the optimal control is of the state feedback form, given explicitly by the Föllmer drift. Moreover, one can show using the Girsanov theorem that every ${Q \in {\cal M}^\mu}$ can be realized as a controlled process of the form (14) for some admissible drift ${u}$ , so that ${X^u_1}$ has law ${\mu}$ and

$\displaystyle D(Q \| P) = {\mathbb E}^u \left[ \frac{1}{2}\int^1_0 |u_t|^2 dt \right] = J(u) + {\mathbb E}^u[\log f(X^u_1)] = J(u) + D(\mu \| \gamma).$

This implies, in turn, that the Föllmer drift that gives the optimal measure ${Q^\mu}$ is the most efficient control, in the sense that the sum of the expected control cost and the expected terminal cost is identically zero, and therefore none of the control effort is wasted along the way. Of course, one could have started with this optimal control formulation (as I do in this paper with Belinda Tzen), but I feel that the above derivation is more “elementary” in the sense that it does not rely on any specialized knowledge beyond basic stochastic calculus.

http://infostructuralist.wordpress.com/?p=752

Extensions

Counting bits with Vapnik and Chervonenkis

mraginsky Apr 28, 2015

Machine learning is about enabling computers to improve their performance on a given task as they get more data. Can we express this intuition quantitatively using information-theoretic techniques? In this post, I will discuss a classic paper by David Haussler, Michael Kearns, and Robert Schapire that (to the best of my knowledge) took the first … Continue reading "Counting bits with Vapnik and Chervonenkis"

Show full content

Machine learning is about enabling computers to improve their performance on a given task as they get more data. Can we express this intuition quantitatively using information-theoretic techniques? In this post, I will discuss a classic paper by David Haussler, Michael Kearns, and Robert Schapire that (to the best of my knowledge) took the first step in this direction. In this post, I will describe some of their results, recast in a more explicitly information-theoretic way.

We will stick to the setting of binary classification: we have a feature space ${{\mathsf X}}$ (which is completely arbitrary) and a binary label space ${{\mathsf Y} = \{0,1\}}$ . Let ${{\mathcal F}}$ be a class of functions ${f : {\mathsf X} \rightarrow {\mathsf Y}}$ , which we will call the hypothesis space. These functions are our models for the relationship between features and labels, which we wish to learn from data. In order to focus on the essential aspects of the problem, we will adopt the Bayesian approach and assume that the “true” model is a random element of ${{\mathcal F}}$ with a given distribution ${Q}$ . We assume that the learning agent knows ${Q}$ , so we can think of this distribution as the learner’s prior. Training data consist of input-output pairs ${(x_t,Y_t)}$ , where ${t=0,1,2,\ldots}$ . Here, each ${x_t}$ is an arbitrary point of ${{\mathsf X}}$ , and for now we will treat ${{\bf x} = (x_1,x_2,\ldots)}$ as an individual sequence. The labels ${Y_t}$ , though, are random and given by ${Y_t = F(x_t)}$ , where ${F \sim Q}$ is the model drawn by Nature from ${Q}$ .

The learning agent receives the training data sequentially: at each time ${t=1,2,\ldots}$ , it has access to ${(x_1,Y_1),\ldots,(x_t,Y_t)}$ . Let us denote the pair ${(x_t,Y_t)}$ by ${Z_t}$ . The learning task is: given the next input ${x_{t+1}}$ , predict the value ${Y_{t+1}}$ . A learning algorithm ${{\mathcal A}}$ is a sequence of possibly randomized mappings ${a_t :(Z^{t-1}, x_{t}) \mapsto \widehat{Y}_t}$ : at each time ${t}$ , the learning agent uses all available training data to generate a prediction for the label ${Y_{t}}$ of ${x_{t}}$ . After generating the prediction, the learning agent observes the true ${Y_{t}}$ . We are interested in the probability that ${{\mathcal A}}$ makes an error at time ${t}$ :

$\displaystyle e_{{\bf x}}({\mathcal A},t) = Q\big[\widehat{Y}_t \neq Y_t\big] \equiv Q\big[a_t(Z^{t-1},x_t) \neq Y_t\big].$

We will refer to the function ${t \mapsto e_{{\bf x}}({\mathcal A},t)}$ as the learning curve of ${{\mathcal A}}$ [note that it depends on the sequence ${{\bf x}}$ of inputs supplied to the learning agent during learning].

How can we measure the learner’s progress? Let’s write out the causal ordering of all the random variables generated in the process of learning:

$\displaystyle F,\widehat{Y}_1,Z_1,\widehat{Y}_2,Z_2,\ldots,\widehat{Y}_t,Z_t,\ldots$

Here, ${F}$ has distribution ${Q}$ , ${\widehat{Y}_t}$ are determined by the learning algorithm ${{\mathcal A}}$ , and ${Z_t}$ are determined by ${x_t}$ and ${F}$ . Let us denote the joint distribution of all of these variables by ${{\mathbb P}}$ . When the learning agent observes ${Z_{t+1}=z_{t}}$ at time ${t}$ , we can quantify its information gain relative to time ${t-1}$ by

$\displaystyle {\rm IG}(t) = D({\mathbb P}_{F|Z^{t}=z^{t}} \| {\mathbb P}_F) - D({\mathbb P}_{Z^{t-1}=z^{t-1}} \| {\mathbb P}_F)$

It’s not hard to see that ${{\rm IG}(t)}$ is nonnegative because ${Z^{t}}$ (information at time ${t}$ ) includes ${Z^{t-1}}$ . The expected information gain is given by

$\displaystyle {\mathbb E} [{\rm IG}(t)] = I(F; Z_t | Z^{t-1}),$

the conditional mutual information between ${F}$ and ${Z_t}$ given ${Z^{t-1}}$ . For our purposes, it will be convenient to express the information gain in terms of a certain quantity called the volume ratio, which quantifies the rate at which the learning agent can zero in on the true model as more observations keep arriving.

To that end, let us first observe that, at time ${t}$ , the agent can localize the unknown model ${F}$ to the set

$\displaystyle {\mathcal V}_t = \left\{ {f \in {\mathcal F}:} f(x_s) = Y_s, 1 \le s \le t \right\}.$

This set is known in learning theory as the version space at time ${t}$ . It is easy to see that the version spaces are nested: ${{\mathcal V}_{t} \subseteq {\mathcal V}_{t-1} \subseteq \ldots \subseteq {\mathcal V}_1 \subseteq {\mathcal V}_0 \equiv {\mathcal F}}$ . Clearly, ${F \in {\mathcal V}_t}$ for all ${t}$ , and for any ${{\mathcal E} \subseteq {\mathcal F}}$

$\displaystyle {\mathbb P}[F \in {\mathcal E}|Z^t=z^t] = Q({\mathcal E} | {\mathcal V}_t) = \frac{Q({\mathcal E} \cap {\mathcal V}_t)}{Q({\mathcal V}_t)}$

This implies, in particular, that

$\displaystyle D({\mathbb P}_{F|Z^{t}=z^{t}} \| {\mathbb P}_F) = \log \frac{1}{Q({\mathcal V}_t)},$

and therefore that

$\displaystyle {\rm IG}(t) = \log \frac{Q({\mathcal V}_{t-1})}{Q({\mathcal V}_{t})},$

where ${Q({\mathcal V}_0) = Q({\mathcal F})=1}$ . Since ${F \in \bigcap_t {\mathcal V}_t}$ , we expect that each training sample shrinks the version space, and that the learning curve of any good learning algorithm should be controlled by the rate at which ${Q({\mathcal V}_t)}$ decreases. For future convenience, let us define the volume ratio at time ${t}$ :

$\displaystyle \xi_t = \frac{Q({\mathcal V}_{t})}{Q({\mathcal V}_{t-1})}.$

Note, by the way, that since ${{\mathcal V}_t \subseteq {\mathcal V}_{t-1}}$ , we have

$\displaystyle \xi_t = \frac{Q({\mathcal V}_t \cap {\mathcal V}_{t-1})}{Q({\mathcal V}_{t-1})} = {\mathbb P}[F \in {\mathcal V}_t | F \in {\mathcal V}_{t-1}].$

Consequently,

$\displaystyle {\rm IG}(t) = - \log \xi_t.$

We are going to use binary algorithms, so the units of all information quantities are bits.

Now let’s get down to business. Following Haussler et al., we will consider the following two learning algorithms:

The Bayes algorithm ${{\mathcal A}^{\rm Bayes}}$ : at time ${t}$ , it partitions the version space ${{\mathcal V}_{t-1}}$ into two disjoint subsets
$\displaystyle {\mathcal S}^y_t = \left\{ f \in {\mathcal V}_{t-1} : f(x_t) = y \right\}, \qquad y \in \{0,1\}$

and then predicts

$\displaystyle \widehat{Y}_t = {\bf 1}\left\{Q({\mathcal S}^1_t)-Q({\mathcal S}^0_t) \ge 0\right\}.$

The Bayes algorithm is optimal in the sense that it minimizes the probability of incorrectly predicting ${Y_t}$ given all available information. However, it is an improper learning algorithm in the sense that the function ${x_t \mapsto \widehat{Y}_t}$ will not, in general, be an element of ${{\mathcal F}}$ .
The Gibbs algorithm ${{\mathcal A}^{\rm Gibbs}}$ : at time ${t}$ , it draws a random ${\widehat{F}_t \sim Q_{F|F \in {\mathcal V}_{t-1}}}$ and predicts
$\displaystyle \widehat{Y}_t = \widehat{F}_t(x_t).$

The Gibbs algorithm is suboptimal, but, unlike the Bayes algorithm, the function ${x_t \mapsto \widehat{Y}_t}$ does belong to ${{\mathcal F}}$ by construction.

Now let’s analyze their learning curves:

The Bayes algorithm will make a mistake whenever ${Q({\mathcal V}_t) < Q({\mathcal V}_{t-1})/2}$ , i.e., whenever ${\xi_t < 1/2}$ . Indeed, using the fact that, by definition of the version space,
$\displaystyle {\mathcal S}^{Y_t}_t = \left\{ f \in {\mathcal V}_{t-1} : f(x_t) = Y_t \right\} \equiv {\mathcal V}_t,$

we have

$\displaystyle Q({\mathcal V}_t) < Q({\mathcal V}_{t-1})/2 \qquad \Longrightarrow \qquad Q({\mathcal S}^{Y_t}_t) < \frac{Q({\mathcal V}_{t-1})}{2} = \frac{Q({\mathcal S}^{Y_t}_t)+Q({\mathcal S}^{1-Y_t}_t)}{2}.$

Thus, if ${\xi_t < 1/2}$ , then ${Q({\mathcal S}^{Y_t}_t) < Q({\mathcal S}^{1-Y_t}_t)}$ , and so ${\widehat{Y}_t \neq Y_t}$ . On the other hand, if ${\xi_t > 1/2}$ , the Bayes algorithm will be correct: ${\widehat{Y_t} = Y_t}$ . Finally, if ${\xi_t = 1/2}$ , then the Bayes algorithm will make a mistake with probability ${1/2}$ . Therefore,

$\displaystyle e_{{\bf x}}({\mathcal A}^{\rm Bayes},t) = {\mathbb E} [\vartheta(1/2-\xi_t)],$

where we have defined the function

$\displaystyle \vartheta(u) = \begin{cases} 0, & u < 0 \\ 1/2, & u = 0 \\ 1, & u > 0 \end{cases}$
The Gibbs algorithm will make a mistake whenever ${\widehat{F}_t}$ is not in the version space ${{\mathcal V}_t}$ . Since ${\widehat{F}_t}$ is drawn from the posterior distribution of ${F}$ given ${F \in {\mathcal V}_{t-1}}$ , we have
$\displaystyle e_{{\bf x}}({\mathcal A}^{\rm Gibbs},t) = {\mathbb P}\left[\widehat{F}_t \not\in {\mathcal V}_t\right] = {\mathbb E} [1-\xi_t].$

Before stating and proving the main result, we need to collect some preliminary facts. First of all, the volume ratio ${\xi_t}$ is always bounded between ${0}$ and ${1}$ . Moreover, we have the following useful formula: for any function ${g : [0,1] \rightarrow [0,1]}$ ,

$\displaystyle {\mathbb E} [g(\xi_t)] = {\mathbb E} [\xi_t g(\xi_t) + (1-\xi_t)g(1-\xi_t)] \ \ \ \ \ (1)$

(the proof is by conditioning on ${Z^{t-1}}$ and using the law of iterated expectation).

Theorem 1 We have the following bounds for the Bayes and the Gibbs algorithms:

$\displaystyle e_{{\bf x}}({\mathcal A}^{\rm Gibbs},t) \ge e_{{\bf x}}({\mathcal A}^{\rm Bayes},t) \ge h^{-1}\left(I(F;Z_t|Z^{t-1})\right) \ \ \ \ \ (2)$

and

$\displaystyle \frac{1}{2}e_{{\bf x}}({\mathcal A}^{\rm Gibbs},t) \le e_{{\bf x}}({\mathcal A}^{\rm Bayes},t) \le e_{{\bf x}}({\mathcal A}^{\rm Gibbs},t) \le \frac{I(F; Z_t | Z^{t-1})}{2}, \ \ \ \ \ (3)$

where ${h^{-1} : [0,1] \rightarrow [0,1/2]}$ is the inverse of the binary entropy function ${h(u) = -u\log u - (1-u)\log(1-u)}$ .

Remark 1 Equation (2) shows that the error probability of the Bayes algorithm at time ${t}$ can be bounded below by a function of the expected information gain at time ${t}$ . It also shows that the error probability of the Gibbs algorithm is at least as large as that of the Bayes algorithm. On the other hand, Equation (3) gives an upper bound on the error performance of the Bayes algorithm in terms of the expected information gain, and it also shows that the error of the Gibbs algorithm is at most twice as large as that of the Bayes algorithm.

Proof: We start by obtaining the following alternative expressions for the error probabilities of the two algorithms, as well as for the expected information gain:

Using (1) with ${g(u) = - \log u}$ , we have

$\displaystyle I(F; Z_t | Z^{t-1}) = {\mathbb E} [-\log \xi_t] = -{\mathbb E} [\xi_t \log \xi_t + (1-\xi_t) \log (1-\xi_t)] = {\mathbb E} [h(\xi_t)].$

This shows, in particular, that the average information gain due to receiving another training sample is no more than one bit.

With ${g(u) = \vartheta(1/2-u)}$ , we have

$\displaystyle e_{{\bf x}}({\mathcal A}^{\rm Bayes},t) = {\mathbb E} [\vartheta(1/2-\xi_t)] = {\mathbb E} [\xi_t \vartheta(1/2-\xi_t) + (1-\xi_t) \vartheta(\xi_t - 1/2)] = {\mathbb E}\left[\min\{\xi_t,1-\xi_t\}\right],$

where we have used the fact that, for any ${u \in [0,1]}$ ,

$\displaystyle u\vartheta(1/2-u) + (1-u)\vartheta(u-1/2) = \min\{u,1-u\}.$

Another way to arrive at this expression is by using the following fact: Suppose we have a biased coin with probability of heads equal to ${p}$ . The best prediction of the outcome of a single toss of that coin, with respect to the Hamming loss ${\ell(y,\widehat{y}) = {\bf 1}\{y \neq \widehat{y}\}}$ , is to predict heads if ${p \ge 1-p}$ and tails otherwise. The expected loss in that case is equal to ${\min\{p,1-p\}}$ . The problem of predicting ${Y_t}$ given ${x_t}$ and ${Z^{t-1}=z^{t-1}}$ is equivalent to the above guessing problem with ${p = \xi_t}$ .

Finally, with ${g(u) = 1-u}$ , we have

$\displaystyle e_{{\bf x}}({\mathcal A}^{\rm Gibbs},t) = {\mathbb E} [1-\xi_t] = 2{\mathbb E} [\xi_t(1-\xi_t)].$

Again, if we consider the problem of guessing the outcome of tossing a ${{\rm Bernoulli}(p)}$ coin, we can think of the following suboptimal scheme: guess ${H}$ with probability ${p}$ and ${T}$ with probability ${1-p}$ (i.e., just mentally simulate the coin toss). The probability of error is exactly ${2p(1-p)}$ , which is also twice the variance of a ${{\rm Bernoulli}(p)}$ random variable. Reasoning conditionally on ${Z^{t-1}=z^{t-1}}$ , we see that the Gibbs algorithm is exactly this scheme with ${p = \xi_t}$ .

We will also use the following chain of easy-to-prove inequalities: for any ${u \in [0,1]}$ ,

$\displaystyle u(1-u) \le \min\{u,1-u\} \le 2u(1-u) \le \frac{h(u)}{2}. \ \ \ \ \ (4)$

We start by proving the lower bound, Equation (2). First, note that, for any ${u \in [0,1]}$ ,

$\displaystyle h^{-1}(h(u)) = \min\{u,1-u\}.$

Moreover, the function ${h^{-1} : [0,1] \rightarrow [0,1]}$ is convex. Therefore,

$\displaystyle e_{{\bf x}}({\mathcal A}^{\rm Bayes},t) = {\mathbb E}\left[\min\{\xi_t,1-\xi_t\}\right] = {\mathbb E} [h^{-1}(h(\xi_t))] \ge h^{-1}\left({\mathbb E} [h(\xi_t)]\right) =h^{-1}\left(I(F;Z_t|Z^{t-1})\right).$

The upper bound, Equation (3), follows by taking ${u=\xi_t}$ in (4) and then taking expectations. $\Box$

Now let’s see what happens when we have to do this over and over again: at each time ${t}$ , the learning agent gives a prediction ${\widehat{Y}_t}$ , observes ${Y_t}$ , updates ${Z_{t-1} \rightarrow Z_t}$ , etc. What can we say about the expected fraction of mistakes? Given a learning algorithm ${{\mathcal A}}$ , the quantity of interest is

$\displaystyle \bar{e}_{{\bf x}}({\mathcal A},T) = \frac{1}{T}\sum^T_{t=1} e_{{\bf x}}({\mathcal A},t),$

for each ${T=1,2,\ldots}$ . In order to get a handle on this quantity, we will need some Vapnik-Chervonenkis combinatorics. Given the sequence ${{\bf x}}$ of input instances, let ${{\mathcal D}_T}$ denote the set of all distinct elements of the tuple ${x^T = (x_1,\ldots,x_T)}$ . Say that two hypotheses ${f,f' \in {\mathcal F}}$ are equivalent if ${f(x)=f'(x)}$ for all ${x \in {\mathcal D}_T}$ . This defines an equivalence relation on ${{\mathcal F}}$ , so we can split it into equivalence classes ${{\mathcal E}_1,\ldots,{\mathcal E}_J}$ . Here, the number of equivalence classes ${J}$ depends on ${{\bf x}}$ and on ${T}$ . How many such classes can we have? Let us order the elements of ${{\mathcal D}_T}$ arbitrarily, say by ${\{x^{(1)},\ldots,x^{(N)}\}}$ , where ${N \le T}$ . Each ${f \in {\mathcal F}}$ can be associated with a binary string ${b(f,{\mathcal D}_T) = (f(x^{(1)}),\ldots,f(x^{(N)}))}$ . Then ${f}$ and ${f'}$ are in the same equivalence class if and only if they give rise to the same binary strings:

$\displaystyle b(f,{\mathcal D}_T) \equiv b(f',{\mathcal D}_T).$

Consequently,

$\displaystyle J = \left| \left\{ b(f,{\mathcal D}_T) : f \in {\mathcal F} \right\} \right|.$

For any subset ${{\mathcal D} \subseteq {\mathcal D}_T}$ , let us define ${b(f,{\mathcal D})}$ in the obvious way: if ${{\mathcal D} = \{x^{(j_1)},\ldots,x^{(j_k)})\}}$ , then we let ${b(f,{\mathcal D}) = (f(x^{(j_1)}),\ldots,f(x^{(j_k)}))}$ . Let ${d_T({\bf x})}$ be the largest integer ${d \le N}$ , such that

$\displaystyle \left| \left\{ b(f,{\mathcal D}) : f \in {\mathcal F} \right\}\right| = 2^{d}$

for all ${{\mathcal D} \subseteq {\mathcal D}_T}$ with ${|{\mathcal D}| = d}$ . A deep result, commonly known as the Sauer-Shelah lemma (see here for some fascinating history), then says that

$\displaystyle J \le \sum^{d_T({\bf x})}_{j=0} {N \choose j} \le \left(\frac{Ne}{d_T({\bf x})}\right)^{d_T({\bf x})} \le \left(\frac{Te}{d_T({\bf x})}\right)^{d_T({\bf x})}. \ \ \ \ \ (5)$

Now we are ready to state and prove the following:

Theorem 2

$\displaystyle \bar{e}_{{\bf x}}({\mathcal A}^{\rm Bayes},T) \le \bar{e}_{{\bf x}}({\mathcal A}^{\rm Gibbs},T) \le \frac{d_T({\bf x})}{2T} \log \left(\frac{Te}{d_T({\bf x})} \right). \ \ \ \ \ (6)$

Proof: We have

$\displaystyle \begin{array}{rcl} \bar{e}_{{\bf x}}({\mathcal A}^{\rm Bayes},T) &\le& \bar{e}_{{\bf x}}({\mathcal A}^{\rm Gibbs},T)\\ &\le& \frac{1}{2T}\sum^T_{t=1} {\mathbb E} [-\log \xi_t] \\ &=& \frac{1}{2T}\sum^T_{t=1} {\mathbb E}\left[\log \frac{Q({\mathcal V}_{t-1})}{Q({\mathcal V}_t)}\right] \\ &=& \frac{1}{2T}{\mathbb E}\left[\log \frac{Q({\mathcal V}_0)}{Q({\mathcal V}_T)}\right] \\ &=& \frac{1}{2T}{\mathbb E}\left[\log \frac{1}{Q({\mathcal V}_T)}\right]. \end{array}$

From the definition of the version spaces, it is immediate that ${{\mathcal V}_T}$ is the equivalence class containing the “true” model ${F \sim Q}$ . In other words, there exists some random ${j(F) \in \{1,\ldots,J\}}$ , such that ${{\mathcal V}_T = {\mathcal E}_{j(F)}}$ . Consequently, we can write

$\displaystyle \begin{array}{rcl} {\mathbb E}\left[\log \frac{1}{Q({\mathcal V}_T)}\right] &=& \sum^J_{j=1} {\mathbb E}\left[{\bf 1}\{F \in {\mathcal E}_j\}\log\frac{1}{Q({\mathcal V}_T)}\right] \\ &=& \sum^J_{j=1} {\mathbb E}\left[{\bf 1}\{F \in {\mathcal E}_j\}\log\frac{1}{Q({\mathcal E}_j)}\right] \\ &=& -\sum^J_{j=1} Q({\mathcal E}_j) \log Q({\mathcal E}_j). \end{array}$

We can recognize the quantity in the last line as the Shannon entropy of the index ${j(F)}$ : that is,

$\displaystyle \begin{array}{rcl} \bar{e}_{{\bf x}}({\mathcal A}^{\rm Bayes},T) &\le& \bar{e}_{{\bf x}}({\mathcal A}^{\rm Bayes},T) \\ &\le& \frac{H(j(F))}{2T}. \end{array}$

Now, since ${j(F)}$ takes values in ${\{1,\ldots,J\}}$ , we can invoke the Sauer-Shelah estimate (5) to get

$\displaystyle H(j(F)) \le \log J \le d_T({\bf x}) \log \left(\frac{T e}{d_T({\bf x})}\right).$

The claimed bound (6) follows. $\Box$

If ${{\mathcal F}}$ is a Vapnik-Chervonenkis class with VC dimension ${d}$ , then we immediately obtain:

Corollary 3

$\displaystyle \bar{e}_{{\bf x}}({\mathcal A}^{\rm Bayes},T) \le \bar{e}_{{\bf x}}({\mathcal A}^{\rm Gibbs},T) \le \frac{d}{2T} \log \left(\frac{Te}{d} \right).$

http://infostructuralist.wordpress.com/?p=728

Extensions

Information flow on graphs

mraginsky May 3, 2014

Show full content

Models of complex systems built from simple, locally interacting components arise in many fields, including statistical physics, biology, artificial intelligence, communication networks, etc. The quest to understand and to quantify the fundamental limits on the ability of such systems to store and process information has led to a variety of interesting and insightful results that draw upon probability, combinatorics, information theory, discrete and continuous dynamical systems, etc. In this post, I would like to focus on a model of distributed storage that was analyzed in 1975 by Donald Dawson in a very nice paper, which deserves to be more widely known.

The system Dawson analyzed is an instance of a Probabilistic Cellular Automaton (PCA), i.e., a system consisting of a discrete collection of nodes, each of which has a discrete state. The configuration of the automaton at each time is specified by listing the state of each node. The state of each node at time ${t+1}$ is determined stochastically by its own state, as well as by the states of its neighbors, at time ${t}$ , independently of everything else. One of the key questions in the theory of PCAs is whether a given automaton is ergodic, i.e., if its configuration eventually “forgets” the initial condition. Thus, Dawson’s result is a sufficient condition for ergodicity of a PCA.

Now let’s move on to precise definitions. Consider an undirected graph ${G = ({\cal V},{\cal E})}$ . The neighborhood of a vertex ${v \in {\cal V}}$ , denoted by ${\partial v}$ , is the set of all vertices ${u \in {\cal V}}$ that are connected to ${v}$ by an edge. We will also denote the set consisting of ${v}$ and all of its neighbors by ${\partial_+ v}$ . Let us fix a finite alphabet ${{\mathsf X}}$ ; for each vertex ${v \in {\cal V}}$ , let ${{\mathsf X}_v}$ denote a copy of ${{\mathsf X}}$ . A configuration is an assignment of a symbol ${x_v \in {\mathsf X}_v}$ to each vertex ${v \in {\cal V}}$ . Thus, ${\Omega = {\mathsf X}^{\cal V}}$ is the space of all configurations. We will denote a generic configuration by ${x = (x_v)_{v \in {\cal V}}}$ ; for any set ${{\cal J} \subset {\cal V}}$ , we will denote by ${x_{\cal J}}$ the natural projection of ${x}$ onto ${{\mathsf X}^{\cal J} = \prod_{v \in {\cal J}}{\mathsf X}_v}$ : ${x_{\cal J} = (x_v)_{v \in {\cal J}}}$ .

Now let’s consider the following model of a distributed one-bit memory cell. Let ${W}$ be a ${{\rm Bernoulli}(1/2)}$ random variable, which is the bit we wish to store. We also pick two probability measures ${\mu_0}$ and ${\mu_1}$ on ${\Omega}$ . Given ${W=w}$ , we draw the initial ${(t=1)}$ configuration ${X_1 = (X_{v,1})_{v \in {\cal V}}}$ at random according to ${\mu_w}$ . We allow arbitrary correlations between the different ${X_{v,1}}$ ‘s, so our storage is distributed in this sense. Now let’s suppose that the state of each vertex evolves stochastically in discrete time according to a local Markov model: for each ${v}$ , we have a channel (random transformation) ${K_v}$ with input alphabet ${{\mathsf X}^{\partial_+ v} = \prod_{u \in \partial_+ v} {\mathsf X}_u}$ and output alphabet ${{\mathsf X}_v}$ . Then the configuration ${X_t = (X_{v,t})_{v \in {\cal V}}}$ at time ${t}$ evolves according to the law

$\displaystyle {\mathbb P}[X_{t+1}=x_{t+1}|W=w,X^t = x^t] = \prod_{v \in {\cal V}}K_v(x_{v,t+1}|x_{\partial_+ v,t}), \qquad t=1,2,\ldots \ \ \ \ \ (1)$

with the initial conditions ${{\mathbb P}[W=w]=1/2}$ , ${w \in \{0,1\}}$ , and ${{\mathbb P}[X_1 = x_1|W=w]=\mu_w(x_1)}$ , ${x_1 \in \Omega}$ . Thus, we end up with a Markov chain

$\displaystyle W \longrightarrow X_1 \longrightarrow X_2 \longrightarrow X_3 \longrightarrow \ldots,$

where at each time ${t}$ the evolution from ${X_t}$ to ${X_{t+1}}$ is also Markov “in space:” the state of vertex ${v}$ at time ${t+1}$ is determined only by the states of the neighbors of ${v}$ at time ${t}$ , i.e., for each ${v}$ and ${t}$ , we have a Markov chain

$\displaystyle \left( W; X^{t-1}; X_{{\cal V} \backslash \partial_+ v,t} \right) \longrightarrow X_{\partial_+ v,t} \longrightarrow X_{v,t+1}.$

Now the question is: does the configuration at an arbitrary time ${t}$ retain any information about ${W}$ ? One way to answer this is to pick an arbitrary finite set ${{\cal J} \subset {\cal V}}$ and to examine the evolution of the mutual information ${I(W; X_{{\cal J},t})}$ , where

$\displaystyle X_{{\cal J},t} = (X_{v,t})_{v \in {\cal J}}.$

If we can show that there exists some ${\varepsilon > 0}$ such that ${I(W; X_{{\cal J},t}) \ge \varepsilon}$ for all ${t}$ and at least one nonempty ${{\cal J}}$ , then our memory cell is capable of storing one bit. If, on the other hand, ${I(W; X_{{\cal J},t}) \rightarrow 0}$ as ${t \rightarrow \infty}$ for every finite ${{\cal J}}$ , then eventually the memory cell will forget ${W}$ .

1. Dawson’s impossibility result

Roughly speaking, Dawson proved the following: if the local channels ${K_v}$ are sufficiently noisy and if the graph ${G}$ is sufficiently sparse, then the mutual information ${I(W; X_{{\cal J},t})}$ will decay to zero exponentially fast. In order to state his result precisely, we need some preliminary definitions. First of all, in order to quantify the noisiness of each channel ${K_v}$ , Dawson introduces the following quantity, which he calls the transmission coefficient of ${K_v}$ :

$\displaystyle \eta_v := \sup_{W \rightarrow Y \rightarrow Z} \frac{I(W; Z)}{I(W; Y)}, \ \ \ \ \ (2)$

where the supremum is over all Markov chains ${W \rightarrow Y \rightarrow Z}$ , where ${W \sim {\rm Bernoulli}(1/2)}$ , the channel from ${W}$ to ${Y \in {\mathsf X}^{\partial_+ v}}$ is arbitrary, and the channel from ${Y}$ to ${Z \in {\mathsf X}_v}$ is given by ${K_v}$ . Moreover, let ${\eta := \sup_{v \in {\cal V}} \eta_v}$ . Clearly, ${\eta_v \le 1}$ for every ${v \in {\cal V}}$ , by the data processing inequality. Consequently, ${\eta \le 1}$ .

Remark 1 Dawson’s definition of the transmission coefficient is reminiscent of the best constant in the data processing inequality of Erkip and Cover (corrected by Anantharam et al.): to define that constant in our setting, one would fix a distribution ${P_Y}$ and let

$\displaystyle \delta^*_v(P_Y) := \sup_{U \rightarrow Y \rightarrow Z} \frac{I(U; Z)}{I(U;Y)}, \ \ \ \ \ (3)$

where the supremum is over all Markov chains ${U \rightarrow Y \rightarrow Z}$ satisfying the constraints ${Y \sim P_Y}$ and ${P_{Z|Y} = K_v}$ . The difference is that in (2) we constrain the marginal distribution of ${W}$ and the stochastic kernel ${P_{Z|Y}}$ , whereas in (3) we constrain the marginal of ${Y}$ and the kernel ${P_{Z|Y}}$ . As shown by Anantharam et al.,

$\displaystyle \delta^*_v(P_Y) = \sup_{Q_Y \neq P_Y} \frac{D(Q_Z \| P_Z)}{D(Q_Y \| P_Y)},$

where the supremum is over all marginal distributions ${Q_Y}$ of ${Y}$ distinct from ${P_Y}$ , and ${Q_Z}$ denotes the marginal distribution of ${Z}$ when ${Y \sim Q_Y}$ . Both ${\eta_v}$ and ${\delta^*_v}$ measure the “noisiness” of ${K_v}$ , but in different ways. In particular, the motivation behind Dawson’s definition is as follows: At any time ${t}$ , the state ${X_{v,t+1}}$ of a given node at time ${t}$ depends only on ${X_{\partial_+v,t}}$ and not on anything else. Thus, if boundary condition, i.e., the configuration of all the nodes except for ${v}$ and its neighbors, is “frozen” at some value ${x_{{\cal V}\backslash\partial_+v}}$ , then the amount of information about ${W}$ that node ${v}$ will receive at time ${t+1}$ will be controlled by the quantity

$\displaystyle \frac{I(W; X_{v,t+1} | X_{{\cal V}\backslash \partial_+v,t} = x_{{\cal V}\backslash\partial_+v})}{I(W; X_{\partial_+v,t} | X_{{\cal V}\backslash \partial_+v,t} = x_{{\cal V}\backslash\partial_+v})},$

which, by definition, is no more than ${\eta_v}$ . We will elaborate on this point later on.

Also, Dawson assumes that ${G}$ has bounded degree, i.e.,

$\displaystyle \Delta := \sup_{v \in {\cal V}}|\partial_+ v| < \infty.$

Now we are in a position to state the following:

Theorem 1 (Dawson) For any finite set ${{\cal J} \subset {\cal V}}$ ,

$\displaystyle I(W; X_{{\cal J},t}) \le |{\cal J}| (\Delta\eta)^{t-1} \log 2. \ \ \ \ \ (4)$

Thus, if ${\Delta \eta < 1}$ , then ${I(W; X_{{\cal J},t}) \rightarrow 0}$ as ${t \rightarrow \infty}$ , and the convergence is exponentially fast.

2. The proof

The proof of Theorem 1 consists of several steps. The first step gives an upper bound on the information value of the state at a given node if we already observed the states of some other nodes. This bound is stated in Lemmas 2 and 3 below.

Lemma 2 Fix two arbitrary, possibly empty finite sets ${{\cal J},{\cal K} \subset {\cal V}}$ . Then, for any ${v \not\in {\cal J}}$ ,

$\displaystyle I(W; X_{v,t+1} | X_{{\cal J},t+1}, X_{{\cal K},t}) \le \eta I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1}, X_{{\cal K},t}). \ \ \ \ \ (5)$

Proof: We begin by noting that, since ${v \not\in {\cal J}}$ ,

$\displaystyle (W,X_{{\cal K} \, \cup\, \partial_+ v,t},X_{{\cal J},t+1}) \rightarrow X_{\partial_+ v,t} \rightarrow X_{v,t+1} \ \ \ \ \ (6)$

is a Markov chain. Now, for any fixed choices of the configurations ${x_{{\cal K},t}}$ and ${x_{{\cal J},t+1}}$ , consider two probability distributions ${\nu_0}$ and ${\nu_1}$ of ${Y \in {\mathsf X}^{\partial_+ v}}$ :

$\displaystyle \nu_w(y) := {\mathbb P}\Big[ Y = y\Big| W=w, X_{{\cal K},t} = x_{{\cal K},t}, X_{{\cal J},t+1} = x_{{\cal J},t+1}\Big], \quad w \in \{0,1\}.$

From (6), if we consider a channel from ${W}$ to ${Y}$ with transition probabilities given by ${\nu_w}$ , and then let ${Z \in {\mathsf X}_v}$ be the output of ${K_v}$ with input ${Y}$ , then

$\displaystyle \begin{array}{rcl} I(W;Y) &=& I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1} = x_{{\cal J},t+1}, X_{{\cal K},t}=x_{{\cal K},t}) \\ I(W;Z) &=& I(W; X_{v,t+1} | X_{{\cal J},t+1} = x_{{\cal J},t+1}, X_{{\cal K},t} = x_{{\cal K},t}), \end{array}$

and ${W \rightarrow Y \rightarrow Z}$ is a Markov chain satisfying the conditions in the definition (2) of the transmission coefficient ${\eta_v}$ . Therefore,

$\displaystyle \begin{array}{rcl} && I(W; X_{v,t+1} | X_{{\cal J},t+1} = x_{{\cal J},t+1}, X_{{\cal K},t} = x_{{\cal K},t}) \nonumber\\ &&\quad \le \eta_v I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1} = x_{{\cal J},t+1}, X_{{\cal K},t}=x_{{\cal K},t}) \\ &&\quad \le \eta I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1} = x_{{\cal J},t+1}, X_{{\cal K},t}=x_{{\cal K},t}) . \end{array}$

Taking the expectation with respect to ${X_{{\cal J},t+1}}$ and ${X_{{\cal K},t}}$ , we obtain (5). $\Box$

Lemma 3 For any finite set ${{\cal J}\subset{\cal V}}$ , any ${v \not\in {\cal J}}$ , and an arbitrary (possibly empty) finite set ${{\cal K}}$ ,

$\displaystyle \begin{array}{rcl} I(W; X_{{\cal J} \cup \{v\},t+1}|X_{{\cal K},t}) &\le& \eta I(W; X_{{\cal J},t+1} | X_{{\cal K}\, \cup\, \partial_+ v,t}) \nonumber\\ && \qquad \qquad + \eta I(W; X_{\partial_+ v,t}|X_{{\cal K},t}) \nonumber\\ && \qquad \qquad + (1-\eta) I(W; X_{{\cal J},t+1}|X_{{\cal K},t}). \end{array}$

Proof: Using the chain rule for mutual information and Lemma 2,

$\displaystyle \begin{array}{rcl} I(W; X_{{\cal J} \cup \{v\},t+1}|X_{{\cal K},t}) &=& I(W; X_{{\cal J},t+1} | X_{{\cal K},t}) + I(W; X_{v,t+1} | X_{{\cal J},t+1}, X_{{\cal K},t}) \nonumber \\ &\le& I(W; X_{{\cal J},t+1} | X_{{\cal K},t}) + \eta I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1},X_{{\cal K},t}). \end{array}$

Now let’s examine the mutual information ${ I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1},X_{{\cal K},t})}$ . Let us use the chain rule to write the conditional mutual information ${I(W; X_{\partial_+ v,t}, X_{{\cal J},t+1} | X_{{\cal K},t})}$ in two ways:

$\displaystyle \begin{array}{rcl} I(W; X_{\partial_+ v,t}, X_{{\cal J},t+1} | X_{{\cal K},t}) &=& I(W; X_{\partial_+ v,t} | X_{{\cal K},t}) + I(W; X_{{\cal J},t+1} | X_{{\cal K} \cup \partial_+ v,t}) \\ &=& I(W; X_{{\cal J},t+1} | X_{{\cal K},t}) + I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1},X_{{\cal K},t}). \end{array}$

Therefore, we can write

$\displaystyle I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1},X_{{\cal K},t}) = I(W; X_{\partial_+ v,t} | X_{{\cal K},t}) + I(W; X_{{\cal J},t+1} | X_{{\cal K} \cup \partial_+ v,t}) - I(W; X_{{\cal J},t+1} | X_{{\cal K},t}).$

Substituting this into our bound on ${I(W; X_{{\cal J} \cup \{v\},t+1}|X_{{\cal K},t})}$ , we obtain the statement of the lemma. $\Box$

The main message of Lemma 3 can be stated more succinctly if we introduce shorthand notation

$\displaystyle I^+_t({\cal J} |{\cal K}) := I(W; X_{{\cal J},t} | X_{{\cal K},t-1}) \ \ \ \ \ (7)$

\and

$\displaystyle I_t({\cal J} | {\cal K}) := I(W; X_{{\cal J},t} | X_{{\cal K},t}) \ \ \ \ \ (8)$

for all finite, possibly empty sets ${{\cal J},{\cal K}\subset{\cal V}}$ and for all ${t=1,2,\ldots}$ , where we also let

$\displaystyle \begin{array}{rcl} && I^+_{t}(\emptyset|{\cal K}) = I_t(\emptyset|{\cal K}) = 0;\\ && I^+_{t}({\cal J} | \emptyset) = I(W; X_{{\cal J},t}); \\ && I^+_1({\cal J} |{\cal K}) = I(W; X_{{\cal J},1}) \end{array}$

for all finite sets ${{\cal J},{\cal K} \subset {\cal V}}$ . A nice way to think about the above definitions is in terms of prediction and correction steps in Bayesian filtering: Eq. (7) corresponds to prediction since it quantifies the amount of information about ${W}$ that the nodes in ${{\cal J}}$ will contain at time ${t}$ given what we know about the state of the nodes in some other set ${{\cal K}}$ at time ${t-1}$ , while Eq. (8) quantifies the impact of knowing the state of the nodes in ${{\cal K}}$ at time ${t}$ , and so represents correction. Moreover, we can write the bound of Lemma 3 as

$\displaystyle I^+_{t+1}({\cal J} \cup \{v\} |{\cal K}) \le (1-\eta) I^+_{t+1}({\cal J} | {\cal K}) + \eta I^+_{t+1}({\cal J} | {\cal K} \cup \partial_+ v) + \eta I_t (\partial_+ v|{\cal K}), \ \ \ \ \ (9)$

for any ${{\cal J},{\cal K}}$ , any ${v \not\in{\cal J}}$ , and any ${t}$ .

The second step of the proof is to show that the exponential decay of information, according to Eq. (4), is a consequence of (9). To that end, Dawson employs a clever inductive argument. Let ${2^{{\cal V}}_0}$ denote the collection of all finite subsets of ${{\cal V}}$ , and for any ${t}$ define a set function ${I_t : 2^{{\cal V}}_0 \rightarrow {\mathbb R}}$ by

$\displaystyle I_t({\cal J}) := I(W; X_{{\cal J},t}) \equiv I\big(W; X_{v,t}, v \in {\cal J}\big).$

The quantity ${I_t(\cdot|\cdot)}$ defined in Eq. (8) can be computed in terms of ${I_t(\cdot)}$ : indeed, by the chain rule we have

$\displaystyle \begin{array}{rcl} I_t({\cal J}|{\cal K}) &=& I(W; X_{{\cal J},t} | X_{{\cal K},t}) \nonumber \\ &=& I(W; X_{{\cal J} \cup {\cal K},t}) - I(W; X_{{\cal K},t}) \nonumber \\ &=& I_t({\cal J} \cup {\cal K}) - I_t({\cal K}). \end{array}$

In addition, if we know ${I^+_{t+1}(\cdot|\cdot)}$ , we can also compute ${I_{t+1}({\cal J} | {\cal K})}$ :

$\displaystyle \begin{array}{rcl} I_{t+1}({\cal J} | {\cal K}) &=& I(W; X_{{\cal J},t+1} | X_{{\cal K},t+1}) \nonumber \\ &=& I(W; X_{{\cal J} \cup {\cal K},t+1}) - I(W; X_{{\cal K},t+1}) \nonumber \\ &=& I_{t+1}({\cal J} \cup {\cal K} | \emptyset) - I_{t+1}({\cal K} | \emptyset) \nonumber \\ &=& I^+_{t+1}({\cal J} \cup {\cal K} | \emptyset) - I^+_{t+1}({\cal K} | \emptyset). \end{array}$

Moreover, using our knowledge of ${I_t(\cdot)}$ and the fact that ${I^+_t(\emptyset|{\cal K}) = 0}$ for all ${{\cal K}}$ , we can do the following: Let us fix an arbitrary node ${u \in {\cal V}}$ . Then, from (9), we derive

$\displaystyle \begin{array}{rcl} I^+_{t+1}(\{u\} | {\cal K}) &\le& (1-\eta) \underbrace{I^+_{t+1}(\emptyset | {\cal K})}_{=0} + \eta \underbrace{I^+_{t+1}(\emptyset | {\cal K} \cup \partial_+ u)}_{=0} + \eta I_t(\partial_+ u | {\cal K}) \nonumber \\ &=& \eta I_t(\partial_+ u | {\cal K}). \end{array}$

Fixing another node ${u \neq v}$ , we have

$\displaystyle \begin{array}{rcl} I^+_{t+1}(\{u,v\} | {\cal K}) &\le& (1-\eta)I^+_{t+1}(\{u\}|{\cal K}) + \eta I^+_{t+1}(\{u\} | {\cal K} \cup \partial_+ v) + \eta I_t(\partial_+ v | {\cal K}) \\ &\le& (1-\eta) \eta I_t(\partial_+ u | {\cal K}) + \eta^2 I_t(\partial_+ u | {\cal K} \cup \partial_+ v) + \eta I_t(\partial_+ v | {\cal K}) \\ &=& (1-\eta)\eta \left[I_t(\partial_+ u | {\cal K}) + I_t(\partial_+ v | {\cal K})\right] + \eta^2 I_t(\partial_+ u | {\cal K} \cup \partial_+ v) \nonumber\\ && \qquad \qquad \qquad - (1-\eta)\eta I_t(\partial_+ v | {\cal K}) + \eta I_t(\partial_+ v | {\cal K}) \\ &=& (1-\eta)\eta \left[I_t(\partial_+ u | {\cal K}) + I_t(\partial_+ v | {\cal K})\right] + \eta^2 \left[I_t(\partial_+ v | {\cal K}) + I_t(\partial_+ u | {\cal K} \cup \partial_+ v) \right] \\ &=& (1-\eta)\eta \left[I_t(\partial_+ u | {\cal K}) + I_t(\partial_+ v | {\cal K})\right] + \eta^2 I_t(\partial_+ u \cup \partial_+ v | {\cal K}), \end{array}$

where the first step uses (9); the second uses our bound on ${I^+_{t+1}(\{u\}|{\cal K})}$ ; in the third, we add and subtract ${(1-\eta)\eta I_t(\partial_+ v | {\cal K})}$ ; and the last step uses the chain rule. Continuing inductively, we see that, for any ${{\cal J}, {\cal K} \in 2^{{\cal V}}_0}$ ,

$\displaystyle I^+_{t+1}({\cal J} | {\cal K}) \le \displaystyle\sum^{|{\cal J}|-1}_{k=0} \eta^{|{\cal J}|-k}(1-\eta)^k \displaystyle\sum_{{\cal J}' \subseteq {\cal J}:\, |{\cal J}\backslash{\cal J}'| = k} I_t\Bigg( \bigcup_{v \in {\cal J}'} \partial_+ v \Bigg| {\cal K}\Bigg).$

In particular, using the formula ${I_{t+1}({\cal J}|{\cal K}) = I^+_{t+1}({\cal J}\cup{\cal K}|\emptyset) - I^+_{t+1}({\cal K}|\emptyset)}$ with ${{\cal K} = \emptyset}$ , we obtain

$\displaystyle I(W; X_{{\cal J},t+1}) \le \displaystyle\sum^{|{\cal J}|-1}_{k=0} \eta^{|{\cal J}|-k}(1-\eta)^k \displaystyle\sum_{{\cal J}' \subseteq {\cal J}:\, |{\cal J} \backslash {\cal J}'| = k} I\Bigg(W; X_{w,t}, w \in \bigcup_{v \in {\cal J}'} \partial_+ v \Bigg). \ \ \ \ \ (10)$

The final step is to complete the proof using the fact that ${\Delta \eta < 1}$ . Since ${W}$ is binary, ${I(W; X_{{\cal J},1}) \le H(W) \le \log 2}$ . Therefore,

$\displaystyle I(W; X_{{\cal J},1}) \le |{\cal J}| \log 2, \qquad \forall {\cal J} \in 2^{{\cal V}}_0. \ \ \ \ \ (11)$

This can be interpreted as follows: if we define the specific information at time ${t}$ by

$\displaystyle U_t := \sup_{{\cal J}} \frac{I(W; X_{{\cal J},t})}{|{\cal J}|},$

then ${U_1 \le \log 2}$ . Now, we write

$\displaystyle \begin{array}{rcl} I(W; X_{{\cal J},2}) &\le& \displaystyle\sum^{|{\cal J}|-1}_{k=0} \eta^{|{\cal J}|-k}(1-\eta)^k \displaystyle\sum_{{\cal J}' \subseteq {\cal J}:\, | {\cal J} \backslash {\cal J}' | = k}I\Bigg(W; X_{w,1}, w \in \bigcup_{v \in {\cal J}'} \partial_+ v \Bigg) \nonumber \\ &\le &\displaystyle\sum^{|{\cal J}|-1}_{k=0} \eta^{|{\cal J}|-k}(1-\eta)^k \displaystyle\sum_{{\cal J}' \subseteq {\cal J}:\, | {\cal J} \backslash {\cal J}' | = k} \left| \bigcup_{v \in {\cal J}'} \partial_+ v \right| \log 2 \nonumber \\ &\le &\displaystyle\sum^{|{\cal J}|-1}_{k=0} \eta^{|{\cal J}|-k}(1-\eta)^k \displaystyle\sum_{{\cal J}' \subseteq {\cal J}:\, | {\cal J} \backslash {\cal J}' | = k} \displaystyle\sum_{v \in {\cal J}'} |\partial_+ v| \log 2 \nonumber \\ &\le& \displaystyle\sum^{|{\cal J}|-1}_{k=0} \eta^{|{\cal J}|-k}(1-\eta)^k {|{\cal J}| \choose |{\cal J}|-k} (|{\cal J}|-k) \Delta \log 2 \nonumber \\ &=& |{\cal J}| \Delta \eta \log 2 \displaystyle\sum^{|{\cal J}|-1}_{k=0} {|{\cal J}|-1 \choose |{\cal J}|-1-k} \eta^{|{\cal J}|-1-k}(1-\eta)^k \nonumber \\ &=& |{\cal J}| \Delta \eta \log 2, \end{array}$

where the first step follows from (10); the second uses (11); the third step is by (over)counting; the fourth step uses the fact that ${|\partial_+ v| \le \Delta}$ for every ${v}$ ; and the remaining steps are simple combinatorics. What we have shown is that the specific information at ${t=2}$ can be bounded as ${U_2 \le \Delta \eta \log 2}$ . Thus, if ${\Delta \eta < 1}$ , then in one step the specific information will strictly decrease. Iterating, we get the bound

$\displaystyle I(W; X_{{\cal J},t}) \le |{\cal J}| (\Delta \eta)^{t-1} \log 2.$

If ${\Delta \eta < 1}$ , then ${U_t \le (\Delta\eta)^{t-1} \rightarrow 0}$ as ${t \rightarrow \infty}$ . This completes the proof of Theorem 1.

http://infostructuralist.wordpress.com/?p=721

Extensions

Two public service announcements

mraginsky Feb 25, 2014

1. The 2014 IEEE North American Summer School on Information Theory will take place June 18-21, 2014 at the Fields Institute in Toronto, Canada. 2. For those of you who use Matlab or Octave, there is a new Information Theoretical Estimators (ITE) toolbox, an open-source toolbox “capable of estimating many different variants of entropy, mutual … Continue reading "Two public service announcements"

Show full content

1. The 2014 IEEE North American Summer School on Information Theory will take place June 18-21, 2014 at the Fields Institute in Toronto, Canada.

2. For those of you who use Matlab or Octave, there is a new Information Theoretical Estimators (ITE) toolbox, an open-source toolbox “capable of estimating many different variants of entropy, mutual information, divergence, association measures, cross quantities, and kernels on distributions.” Some more details are available in the guest post by the toolbox’s creator, Zoltán Zsabó, at the Princeton Information Theory b-log.

http://infostructuralist.wordpress.com/?p=719

Extensions

A graph-theoretic derivation of the Gilbert-Varshamov bound

mraginsky Sep 23, 2013

Just a quick note for my reference, but it may be of interest to others. Let denote the size of the largest code over a -ary alphabet that has blocklength and minimum distance . The well-known Gilbert-Varshamov bound says that where is the volume of a Hamming ball of radius in . The usual way … Continue reading "A graph-theoretic derivation of the Gilbert-Varshamov bound"

Show full content

Just a quick note for my reference, but it may be of interest to others.

Let ${A_q(n,d)}$ denote the size of the largest code over a ${q}$ -ary alphabet that has blocklength ${n}$ and minimum distance ${d}$ . The well-known Gilbert-Varshamov bound says that

$\displaystyle A_q(n,d+1) \ge \frac{q^n}{V_q(n,d)},$

where

$\displaystyle V_q(n,d) = \sum^d_{i=0}{n \choose i}(q-1)^i$

is the volume of a Hamming ball of radius ${d}$ in ${\{0,\ldots,q-1\}^n}$ . The usual way of arriving at the GV bound is through a greedy construction: pick an arbitrary codeword ${x}$ , then keep adding codewords that are at Hamming distance of at least ${d+1}$ from all codewords that have already been picked. When this procedure terminates, the complement of the union of the Hamming balls of radius ${d}$ around each of the codewords should be empty — otherwise, you will have at least one more codeword at distance of at least ${d+1}$ from the ones already picked, and this would mean that the procedure could not have terminated.

As it turns out, there is another way of deriving the GV bound using graph theory that I have learned from a nice paper by Van Vu and Lei Wu. They use this graph-theoretic interpretation to arrive at an asymptotic improvement of the GV bound. Their result, which I will not go into here, extends an earlier result by Tao Jiang and Alex Vardy for binary codes. As far as I can tell, the graph-theoretic ideas go back to the Jiang-Vardy paper as well.

In order to proceed, we need some definitions and a lemma. Let ${G = (V,E)}$ be an undirected graph. A set ${S \subseteq V}$ of vertices is called independent if no two vertices in ${S}$ are connected by an edge. The independence number of ${G}$ , denoted by ${I(G)}$ , is the cardinality of the largest independent set. The following lemma is folklore in graph theory:

Lemma 1 Suppose that ${G}$ is ${D}$ -regular, i.e., every vertex has exactly ${D}$ neighbors. Then

$\displaystyle I(G) \ge \frac{|V|}{D+1}. \ \ \ \ \ (1)$

Proof: Let ${I}$ be a maximal independent set. Any vertex ${v \in V\backslash I}$ is connected by an edge to at least one ${v' \in I}$ , because otherwise ${v}$ would have to be included in ${I}$ , which would contradict maximality. Therefore, there are at least ${|V \backslash I| = |V| - |I|}$ edges with one vertex in ${I}$ and another vertex in ${V \backslash I}$ . On the other hand, because ${G}$ is ${D}$ -regular, there can be at most ${D|I|}$ such edges. This means that

$\displaystyle D|I| \ge |V| - |I|.$

Rearranging, we get (1). $\Box$

Now let us construct the following graph (what Jiang and Vardy call the Gilbert graph): associate a vertex to each word in ${\{0,\ldots,q-1\}^n}$ , and connect two vertices by an edge if and only if the Hamming distance between the corresponding words is at most ${d}$ . This graph has ${q^n}$ vertices, and each vertex has degree ${D = V_q(n,d)-1}$ . Moreover, there is a one-to-one correspondence between independent sets of vertices and ${q}$ -ary codes of length ${n}$ and minimum distance at least ${d+1}$ , and the independence number of the Gilbert graph is equal to ${A_q(n,d+1)}$ . The bound (1) is then precisely the GV bound.

Briefly, the Vu-Wu improvement of the GV bound exploits the deep fact that, when the neighborhood of any vertex in a ${D}$ -regular graph is very sparse (in the sense that it has a lot fewer than ${{D \choose 2}}$ edges), the lower bound (1) can be significantly tightened. Apparently, actually counting the number of edges in such a neighborhood of any vertex of the Gilbert graph (by regularity, we may as well look at the neighborhood of the all-zero word) is rather complicated; Vu and Wu instead look at a suitable asymptotic regime when ${n}$ is large and ${d = \alpha n}$ for some ${\alpha \le 1-1/q}$ and replace exact combinatorial bounds by entropy bounds.

http://infostructuralist.wordpress.com/?p=716

Extensions

Lossless source coding at Western Union

mraginsky Sep 19, 2013

… circa 1935. Here is a quote from Single-Story America (translated as Little Golden America), an American travelogue by Ilya Ilf and Yevgeny Petrov: There is a whole book of readymade telegrams, long and convincing, lavishly composed telegrams for all occasions. Sending such a telegram costs only twenty-five cents. You see, what gets transmitted over … Continue reading "Lossless source coding at Western Union"

Show full content

… circa 1935. Here is a quote from Single-Story America (translated as Little Golden America), an American travelogue by Ilya Ilf and Yevgeny Petrov:

There is a whole book of readymade telegrams, long and convincing, lavishly composed telegrams for all occasions. Sending such a telegram costs only twenty-five cents. You see, what gets transmitted over the telegraph is not the text of the telegram, but simply the number under which it is listed in the book, and the signature of the sender. This is quite a funny thing, reminiscent of Drugstore Breakfast #2. Everything is served up in a ready form, and the customer is totally freed from the unpleasant necessity to think, and to spend money on top of it.

Typical set encoding, anyone?

http://infostructuralist.wordpress.com/?p=713

Extensions

ISIT 2013: two plenaries on concentration of measure

mraginsky Jul 30, 2013

Of the five plenary talks at this year’s ISIT, two were about concentration of measure: Katalin Marton’s Shannon lecture on “Distance-divergence inequalities” and Gabor Lugosi’s talk on “Concentration inequalities and the entropy method” the next morning. Since the topic of measure concentration is dear to my heart, I thought I would write down a few … Continue reading "ISIT 2013: two plenaries on concentration of measure"

Show full content

When we speak of concentration of measure, we mean the following. Let ${X}$ be a random variable taking values in some metric space ${{\mathsf X}}$ . Then we say that ${X}$ (or, more correctly, the distribution of ${X}$ ) has the concentration property if, for any set ${A \subset {\mathsf X}}$ such that ${{\mathbb P}(X \in A) \ge 1/2}$ , we have

$\displaystyle {\mathbb P}\left( d(X,A) \le r\right) \xrightarrow{r \rightarrow \infty} 1. \ \ \ \ \ (1)$

Here, ${d(x,A)}$ is the distance from the point ${x \in {\mathsf X}}$ to the set ${A}$ :

$\displaystyle d(x,A) := \inf_{y \in A} d(x,y).$

Another way to express (1) is as follows: for any set ${A \subset {\mathsf X}}$ and any ${r \ge 0}$ , define the ${r}$ -blowup of ${A}$ by

$\displaystyle A_r := \left\{ y \in {\mathsf X}: d(y,A) \le r \right\} \equiv \left\{ y \in {\mathsf X}: \exists x \in A \text{ such that } d(x,y) \le r \right\}.$

Then ${X}$ has the concentration property if

$\displaystyle {\mathbb P}(X \in A) \ge 1/2 \quad \Longrightarrow \quad \lim_{r \rightarrow \infty} {\mathbb P}(X \in A_r) = 1.$

In other words, ${X}$ has the concentration property if any set containing ${X}$ with not too small a probability can be “blown up” to contain ${X}$ with near-certainty.

Here are two classic examples of concentration:

Gaussian distribution in Euclidean space. Let ${{\mathsf X} = {\mathbb R}^n}$ and take ${d(x,y) = \| x - y \|_2}$ — the usual Euclidean distance. Let ${X}$ be a standard ${n}$ -dimensional Gaussian random variable, i.e., ${X \sim N(0,I_n)}$ , where ${I_n}$ is the ${n\times n}$ identity matrix. Then for any ${r \ge 0}$ we have
$\displaystyle {\mathbb P}(X \in A) \ge 1/2 \quad \Longrightarrow \quad {\mathbb P}(X \in A_r) \ge 1 - e^{-r^2/2}.$
Uniform distribution in Hamming space. Let ${{\mathsf X}}$ be the Hamming cube ${\{0,1\}^n}$ equipped with the normalized Hamming distance
$\displaystyle d(x,y) = \frac{1}{n}\sum^n_{i=1}1_{\{x_i \neq y_i\}}$

that counts the fraction of bits in which ${x = (x_1,\ldots,x_n)}$ and ${y = (y_1,\ldots,y_n)}$ disagree. Let ${X}$ have the uniform distribution on ${\{0,1\}^n}$ , i.e., ${{\mathbb P}(X=x) = 2^{-n}}$ for all ${x}$ . Then

$\displaystyle {\mathbb P}(X \in A) \ge 1/2 \quad \Longrightarrow \quad {\mathbb P}(X \in A_r) \ge 1 - e^{-2nr^2}.$

These two examples suggest that we should aim for “hard” statements in the form of sharp bounds on the concentration function

$\displaystyle \alpha_X(r) := \sup_{A:\, {\mathbb P}(X \in A) \ge 1/2} {\mathbb P}(X \not\in A_r)$

as opposed to merely “soft” statements of the form ${\alpha_X(r) \rightarrow 0}$ as ${r \rightarrow \infty}$ . The $64,000 question is: how do we get such bounds?

There are two ways to accomplish this goal, and the main idea underlying these two ways is to replace sets with some other objects that are hopefully easier to handle. The first way is to replace sets by probability measures, the second is to replace them by functions. Here is what I mean:

Fix a set ${A \subset {\mathsf X}}$ with ${\Pr(X \in A) > 0}$ . Let ${P}$ denote the distribution of ${X}$ , and let ${P_A}$ denote the conditional distribution of ${X}$ given ${X \in A}$ . That is, for any (measurable) set ${B \subset {\mathsf X}}$ we have

$\displaystyle P_A(B) := \frac{P(A \cap B)}{P(A)}.$

I am using the subscript notation ${P_A}$ instead of the more usual ${P(\cdot|A)}$ to indicate the fact that ${P_A}$ is a probability measure in its own right. In this way, we can associate to each non-null set ${A}$ a probability measure ${P_A}$ . Now, here is a very simple observation that turns out to be very consequential:

$\displaystyle D(P_A \| P) = \log \frac{1}{P(A)}. \ \ \ \ \ (2)$

This is very easy to prove: for any set ${B}$ we have

$\displaystyle P_A(B) = \frac{1}{P(A)}\int_B 1_A(x) P(dx),$

so ${P_A}$ is absolutely continuous with respect to ${P}$ with the Radon-Nikodym derivative ${dP_A/dP = 1_A/P(A)}$ . Therefore, by definition of the divergence,

$\displaystyle D(P_A \| P) = \int dP_A \log \frac{dP_A}{dP} = \frac{1}{P(A)}\int_A dP \log\frac{1}{P(A)} = \log \frac{1}{P(A)}.$

So if we are interested in bounding the probabilities of various sets ${A}$ , we may hope to get some mileage out of the relationship (2).

On the other hand, we may also associate to a set ${A}$ with ${P(A) > 0}$ the function

$\displaystyle f_A(x) := d(x,A) \equiv \inf_{y \in A} d(x,y).$

This function is Lipschitz: for any ${x,x' \in {\mathsf X}}$ ,

$\displaystyle f_A(x) - f_A(x') = \inf_{y \in A}d(x,y) - \inf_{y \in A} d(x',y) \le \sup_{y \in A} \left[d(x,y) - d(x',y)\right] \le d(x,x'),$

where the last step is by the triangle inequality. Interchanging the roles of ${x}$ and ${x'}$ , we get the Lipschitz property. Moreover, let us consider the random variable ${Z = f_A(X)}$ , where ${X}$ is our original ${{\mathsf X}}$ -valued random variable. Then we immediately notice two things:

For any ${r \ge 0}$ , ${{\mathbb P}(Z \le r) = {\mathbb P}\left(d(X,A) \le r\right) = {\mathbb P}(A_r)}$ .
If ${P(A) = {\mathbb P}(X \in A) \ge 1/2}$ , then ${0}$ is a median of ${Z}$ , in the sense that
$\displaystyle {\mathbb P}(Z \le 0) = P(A) \ge 1/2 \qquad \text{and} \qquad {\mathbb P}(Z > 0) \ge 1/2$

(the second inequality is obviously true since ${Z}$ is nonnegative with probability ${1}$ ).

These two observations suggest that we may obtain concentration bounds by bounding the probability that a given Lipschitz function of ${X}$ deviates from its median by more than ${r}$ . In fact, it is easy to derive an alternative expression for the concentration function ${\alpha_X}$ :

$\displaystyle \alpha_X(r) = \sup_{{\rm 1-Lipschitz\,} f} {\mathbb P}\Big( f(X) > m_f + r\Big), \ \ \ \ \ (3)$

where ${m_f}$ denotes any median of ${f(X)}$ . We already showed, by passing from ${A}$ to ${f_A = d(\cdot,A)}$ , that ${\alpha_X}$ is bounded from above by the quantity on the right-hand side of (3):

$\displaystyle \alpha_X(r) = \sup_{A :\, P(A) \ge 1/2} {\mathbb P} \Big( f_A(X) > \underbrace{m_{f_A}}_{=0} + r \Big ) \le \sup_{{\rm 1-Lipschitz\,}f} {\mathbb P}\Big( f(X) > m_f + r\Big)$

To prove the reverse inequality, fix any ${1}$ -Lipschitz function ${f}$ and consider the set ${A_f : = \left\{ x \in {\mathsf X} : f(x) \le m_f \right\}}$ , where ${m_f}$ is any median of ${f}$ . Then, by definition,

$\displaystyle {\mathbb P}(X \in A_f) = {\mathbb P}\Big( f(X) \le m_f \Big) \ge 1/2.$

Moreover, if we consider the ${r}$ -blowup

$\displaystyle [A_f]_r = \Big\{ x \in {\mathsf X}: d(x,A_f) \le r \Big\},$

then for any ${x \in {\mathsf X}}$ and any ${y \in A_f}$ we must have

$\displaystyle f(x) - m_f \le f(x) - f(y) \le d(x,y),$

where the last step is by the Lipschitz property of ${f}$ . Consequently, by definition of the concentration function,

$\displaystyle {\mathbb P}\Big( f(X) > m_f + r \Big) \le {\mathbb P}\Big( d(X,A_f) > r \Big) = 1 - P\left([A_f]_r\right) \le \alpha_X(r).$

By passing to the functional viewpoint, we obtain another equivalent characterization of the concentration property: a random variable ${X}$ taking values in a metric space ${({\mathsf X},d)}$ has the concentration property if real-valued Lipschitz functions ${X}$ are “nearly constant.”

Let’s look at the first, probabilistic viewpoint, which was born out of a 1996 breakthrough paper by Marton. Given a metric space ${({\mathsf X},d)}$ , let us define the ${L_1}$ Wasserstein distance (or transportation distance) between any two probability measures ${P}$ and ${Q}$ on it:

$\displaystyle W_1(P,Q) := \inf_{X \sim P, Y \sim Q} {\mathbb E}[d(X,Y)],$

where the infimum is over all jointly distributed random variables ${X,Y \in {\mathsf X}}$ , such that ${P_X = P}$ and ${P_Y = Q}$ . Now consider a random variable ${X \in {\mathsf X}}$ , for which we wish to establish concentration. What Marton showed is the following: Suppose the distribution ${P}$ of ${X}$ satisfies the ${L_1}$ transportation inequality

$\displaystyle W_1(P,Q) \le \sqrt{2c\, D(Q \| P)} \ \ \ \ \ (4)$

for some constant ${c > 0}$ . Then ${X}$ has the concentration property, and moreover

$\displaystyle P(A) \ge 1/2 \quad \Longrightarrow \quad P(A_r) \ge 1 - \exp\left(- \frac{1}{2c}\left( r - \sqrt{2c \log 2}\right)^2 \right), \forall r > \sqrt{2 c \log 2}.$

Marton’s proof is breathtakingly beautiful. Consider any two sets ${A,B}$ with ${P(A),P(B) \neq 0}$ . Recalling our notation for conditional distributions, we can write

$\displaystyle \begin{array}{rcl} W_1(P_A,P_B) &\le& W_1(P_A,P) + W_1(P_B,P) \\ &\le& \sqrt{2c\, D(P_A \| P)} + \sqrt{2c\, D(P_B \| P)} \\ &=& \sqrt{2c \log \frac{1}{P(A)}} + \sqrt{2c \log \frac{1}{P(B)}}, \end{array}$

where in the first step we have used the triangle inequality, in the second we have used the fact that ${P}$ satisfies the transportation inequality (4), and in the last step we have used the formula (2). Now suppose that ${P(A) \ge 1/2}$ and let ${B = A^c_r}$ for some ${r}$ , where ${c}$ denotes set-theoretic complement. Then we can show that ${W_1(P_A,P_B) \ge d(A,B) \ge r}$ . On the other hand,

$\displaystyle \log \frac{1}{P(A)} \le \log 2 \qquad \text{and} \qquad \log\frac{1}{P(B)} = \log\frac{1}{1-P(A_r)}.$

Combining these facts gives us the bound

$\displaystyle r \le \sqrt{2c \log 2} + \sqrt{2c \log \frac{1}{1-P(A_r)}}$

that holds for all ${r}$ . If ${r > \sqrt{2 c \log 2}}$ , then we get

$\displaystyle P(A_r) \ge 1 - \exp\left( - \frac{1}{2c}\left(r - \sqrt{2c \log 2}\right)^2\right),$

so we indeed have concentration and a sharp bound on ${\alpha_X(r)}$ , at least for large enough values of ${r}$ . The main message here is that, in order to study concentration, it suffices to work on the level of probability measures and to focus one’s effort on showing that the distribution of ${X}$ satisfies a suitable transportation inequality. Since Marton’s original work, there have been many refinements and extensions, which I will not go into here. One such result, due to Sergey Bobkov and Friedrich Götze, says that ${P}$ satisfying a transportation inequality (4) is equivalent to the Gaussian concentration property

$\displaystyle \alpha_X(r) \le e^{-r^2/2c}, \qquad \forall r \ge 0.$

Now let’s look at the complementary functional viewpoint. Recall that we seek tight upper bounds on deviation probabilities of the form

$\displaystyle {\mathbb P}\Big( f(X) \ge m_f + r \Big), \qquad \forall r > 0.$

It is easier to work with means instead of medians, and indeed it can be shown that concentration around the mean is equivalent to concentration around any median. So let’s focus on the mean. Let ${X}$ , as before, be a random variable over some metric space ${({\mathsf X},d)}$ , and consider a Lipschitz function ${f : {\mathsf X} \rightarrow {\mathbb R}}$ such that ${{\mathbb E}[f(X)] = 0}$ . We can apply the well-known Chernoff trick: for any ${r,\lambda > 0}$ we have

$\displaystyle {\mathbb P}\Big( f(X) \ge r \Big) = {\mathbb P}\Big( e^{\lambda f(X)} \ge e^{\lambda r} \Big) \le e^{-\lambda r} {\mathbb E}[e^{\lambda f(X)}].$

Now the whole affair hinges on the availability of tight upper bounds on the logarithmic moment-generating function ${\Lambda(\lambda) := \log {\mathbb E}[e^{\lambda f(X)}]}$ . The entropy method, which was the subject of Gabor Lugosi’s plenary, is the name for a set of techniques for bounding ${\Lambda(\lambda)}$ by means of various inequalities involving the relative entropy between various tilted distributions derived from ${P}$ and ${P}$ itself. The entropy method has roots in the work of Michel Ledoux, who in turn distilled it from some very deep results of Michel Talagrand.

The simplest version of the entropy method goes something like this. Let us define, for any ${t \in {\mathbb R}}$ , the tilted distribution ${P^{(t)}}$ via

$\displaystyle \frac{dP^{(t)}}{dP}(x) = \frac{e^{tf(x)}}{e^{\Lambda(t)}}$

(assuming, of course, that ${\Lambda(t)}$ exists and is finite). Then we have

$\displaystyle \begin{array}{rcl} D(P^{(t)} \| P) &=& \int dP^{(t)} \left[ tf - \Lambda(t)\right] \\ &=& \frac{1}{e^{\Lambda(t)}} t\, {\mathbb E}\left[f(X)e^{tf(X)}\right] - \Lambda(t) \\ &=& t \Lambda'(t) - \Lambda (t) \\ &=& t^2 \frac{d}{dt} \left(\frac{\Lambda(t)}{t}\right). \end{array}$

Integrating and using the fact that ${\Lambda(0) = 0}$ , we get

$\displaystyle \Lambda(\lambda) = \lambda \int^\lambda_0 \frac{D(P^{(t)} \| P)}{t^2} dt. \ \ \ \ \ (5)$

Now suppose that we can bound ${D(P^{(t)} \| P) \le ct^2/2}$ for some ${c > 0}$ . Then from (5) we have

$\displaystyle \Lambda(\lambda) \le \frac{c\lambda^2}{2}, \qquad \forall t$

which in turn gives the Gaussian tail bound

$\displaystyle {\mathbb P}\Big( f(X) \ge r \Big) \le \inf_{\lambda > 0} \exp\left(-\lambda r + \frac{c\lambda^2}{2}\right) = \exp\left(-\frac{r^2}{2c}\right).$

This is the so-called Herbst argument. Of course, I have glossed over the most nontrivial part — namely, showing that we can bound the relative entropy ${D(P^{(t)} \| P)}$ by a quadratic function of ${t}$ . Gabor’s talk contained numerous examples of how one could get such bounds in the special case when ${X}$ is an ${n}$ -tuple of independent random variables taking values in some base space ${{\mathcal X}}$ , ${X = (X_1,\ldots,X_n) \in {\mathcal X}^n}$ , and the function ${f}$ is not “too sensitive” to local modifications of its arguments. This takes us back to probability on metric spaces.

Here is one classic example. Suppose that our function ${f}$ has the bounded difference property, i.e., there exist some constants ${c_1,\ldots,c_n \ge 0}$ , such that changing the ${i}$ th argument of ${f}$ while keeping others constant will change the value of ${f}$ by at most ${c_i}$ :

$\displaystyle \sup_{x_1,\ldots,x_n}\sup_{x'_i}\left| f(x_1,\ldots,x_i,\ldots,x_n) - f(x_1,\ldots,x'_i,\ldots,x_n) \right| \le c_i.$

We can express this more succinctly as a Lipschitz property of ${f}$ if we define the weighted Hamming metric

$\displaystyle d(x,y) = \sum^n_{i=1}c_i 1_{\{x_i \neq y_i\}} \ \ \ \ \ (6)$

(we can assume without loss of generality that all the ${c_i}$ ‘s are strictly positive, because we can simply ignore those coordinates of ${x}$ that do not affect the value of ${f}$ ). Then the bounded difference property is equivalent to saying that ${f}$ is ${1}$ -Lipschitz with respect to this weighted Hamming metric. Moreover, it is possible to show that any product probability measure ${P = P_1 \otimes \ldots \otimes P_n}$ on the product space ${{\mathsf X} = {\mathcal X}^n}$ satisfies the transportation inequality

$\displaystyle W_1(Q,P) \le \sqrt{2c\, D(Q \| P)},$

where the Wasserstein distance is computed with respect to the weighted Hamming metric (6), and ${c = \frac{1}{4}\sum^n_{i=1}c^2_i}$ . By the Bobkov–Götze result quoted above, this is equivalent to the concentration bound

$\displaystyle {\mathbb P}\left( f(X) \ge {\mathbb E}[f(X)] + r \right) \le \exp\left( - \frac{r^2}{2c}\right) = \exp\left( - \frac{2r^2}{\sum^n_{i=1}c^2_i}\right).$

This is the well-known McDiarmid’s inequality, which was originally derived using martingale techniques, but here we have arrived at it through a completely different route that took us back to where we started: concentration of Lipschitz functions around their means and/or medians, which (as we saw) is the same thing as the original, “stochastic-geometric” view of the concentration of measure phenomenon.

http://infostructuralist.wordpress.com/?p=707

Extensions

It’s for a good cause!

mraginsky May 3, 2013

Endorse the petition to honor Claude Elwood Shannon with a United States Postal Service stamp on the 100th anniversary of his birth.

Show full content

Endorse the petition to honor Claude Elwood Shannon with a United States Postal Service stamp on the 100th anniversary of his birth.

http://infostructuralist.wordpress.com/?p=704

Extensions

https://infostructuralist.wordpress.com/feed

Posts