Jaan Lı — GeistHaus

Jaan Lı Feb 19, 2023 Updated Feb 19, 2023

Show full content

One fact is all it takes to create change. I started this nonprofit foundation and assembled a team of co-founders with superpowers to scale our AI for health and education globally. I gave a keynote at an NIH conference about our first product.

One Fact Foundation was originally published by Jaan Lı at Jaan Lı on February 19, 2023.

https://jaan.io/one-fact-foundation

Virtual Thesis Defense - Giving and Recording a Stressful Zoom Presentation

Jaan Lı May 21, 2020 Updated May 21, 2020

Show full content

I defended my PhD on Zoom (YouTube). It was stressful, but I hope these notes save you time for presenting and enable you to record higher-quality academic presentations. Please let me know if you have any suggestions I missed!

General Presentation

Many tips from meatspace apply. I recommend Matt Might’s tri-tips:

For example, using Keynote and a bluetooth clicker makes a difference. In addition, I suggest standing to give the presentation and wearing shoes, and anything else that would help make it feel more like a standard academic presentation instead of mid-pandemic grilling hour.

The ad-hoc setup I used for my thesis defense.

Camera, Lighting, Audio, Makeup

I am not an expert and guessed at all of this the day of the defense. It seems like an easier solution is to use a smartphone for video with the Filmic Pro app, as suggested in Conan O’Brien’s filming-from-home setup.

Camera: I borrowed a friend’s Fujifilm X-T1. On my Macbook, it would only be usable if the Camlink was plugged into the right-hand-side USB 3.0 port.

Camera placement: I put the camera just above the external display the Keynote slide show was playing on. In hindsight, a laptop would have been sufficient, along with an external display to force Keynote to put the presenter view on the laptop. In the video, my eyes visibly look to the left or right on the large external display, and a laptop would have acted more like a teleprompter to keep my eyes close to the camera.

Where to stand: by looking at the video feed, I marked the floor with tape at the line where I would go out of the frame if I got too close to the laptop.

Makeup: I am not an expert, but I was lucky to be powdered to hide some of the stress-induced hyperhidrosis in my oily T-zone.

Lighting: I used a seasonal affective disorder light therapy lamp (ironically called the Happy Light) as a makeshift high-CRI ring light.

Audio: I really wanted to wrangle a microphone cable on video, and thought it would be funny to use a hand-held microphone during the presentation (not to mention higher-quality audio). This personal just-for-me Easter egg was just silly enough to make my day. I used a Shure SM58 plugged into an external audio device, and recorded it with Quicktime.

Recording and Editing

Setup for recording: In Zoom, Conan O’Brien’s setup covers this. Record to separate audio channels and record in high definition. As backup audio recordings I used my laptop microphone and smartphone. I used Elgato Game Capture to record the video, but in hindsight would have used Adobe Premiere to make editing easier. Keynote has a record feature for recording the slide show for higher resolution.

Recording: The day-of recording checklist was specific and non-deterministic: turn on smartphone audio recording, turn on Quicktime audio recording, turn on Game Capture HD recording for the video feed, then turn on Zoom recording, then screen share on Zoom (screen sharing to an external display renders the menu bar becomes inaccessible), and only then turn on Keynote recording. If these steps were done out of order, something—typically Zoom—would break. After making this checklist, I had 5 minutes left before my defense with which to rehearse. Please don’t put yourself in this position like I did.

Editing: it takes a while to edit both the audio and video. I used Ableton to de-ess, compress, and add reverbe to the audio, and stitch together the Zoom recording with the recording from my microphone. In Adobe Premiere, I synchronized the audio to the video, applied minor color correction, and exported it in high definition. Tools such as ffmpeg, homebrew, and imagemagick were very useful in converting between audio and video formats, or fixing timing issues, or making cute GIFs. In hindsight, it is worth paying someone to do this as it is a lot of work, and YouTube is rife with ‘I paid someone $5 to edit my video!!!!’ reviews of Fiverr as an economical option.

Slides

Most likely, people will use their personal laptops to view your defense, so the thumbnail of the main speaker will be tiny and the slide show itself will be huge. Small details on slides will be noticeable. Use Keynote to make the presentation (see mine here). Use Illustrator or Inkscape to make graphics and icons—I find free icons from a search engine, then edit the vectorized graphics and export to PDF; my Illustrator file is here. For math in Keynote, I recommend LaTeXiT that comes with MacTex/MikTeX, and putting white squares over new lines of math or terms in equations with the ‘dissolve in’ animation. Adding dissolving rectangles helped me go slower and more pedagogically with already-familiar material.

Practice

I recommend giving the talk several times beforehand, to a live audience (physically-distanced if at all possible). I feel very lucky that I was able to give parts of my presentation in stressful high-stakes environments ahead of time, as preparation. As an example of how jarring it is to give a virtual presentation: on Zoom, it was impossible to know who was laughing because everyone was muted and had their video off. This meant when I got to a joke in my presentation, I felt extremely uncomfortable because I could not see reactions, so did my best to deliver it deadpan. However, because I had given parts of the presentation in front of a live audience, I was more confident that some parts might still be funny, and this helped me keep going rather than freeze up.

If you are practicing on Zoom, you might ask a few audience members to leave their video feed on during your presentation. I did not think of this ahead of time, and had to consciously remind myself that I was talking to humans rather than a camera. During the question and answer period, it felt a lot nicer to be talking to human video feeds than to a Fujifilm X-T1 staring me down.

After giving the talk several times, a full run-through entails going from dressing up, getting the lighting correct, all the recordings finagled, answering questions from the audience after a full presentation, and practicing editing in Premiere or via Fiverr. This seems like a lot of prep, but the cognitive load it entails is massive on top of a stressful defense. The run-through is worth it, to learn to trust or mistrust failure modes appropriately and figure out what to do in worst-case scenarios.

Day of the Defense

Assume Murphy’s Law: everything that can go wrong, will go wrong. This held true in my case. Keep several checklists on hand with what you need to do, keyboard shortcuts you might need (such as muting yourself or the audience), water and snacks, etc.

For example, Zoom became non-deterministic: one example of a bug I discovered 30 minutes before the presentation was that if screen sharing was on, the Mac menu bar became inaccessible, and I had to learn and write down all the keyboard shortcuts I needed for all the apps during the presentation. (One would need to mouse to the top of the primary display, only to have the menu bar appear in the second display.)

Another Zoom bug: broken screen-sharing after making Keynote full-screen, namely if an external display is plugged in and is being used for Keynote presenter mode, Zoom will stream the presenter mode instead of the actual presentation.

Silver Linings

I recommend bcc’ing many people to invite them to your defense, because this is now possible and low-stakes. I was scared to do this, but went through my short message service history, WhatsApp, Signal, Facebook, email, and (gratitude) journals to remember who to invite, and was surprised at the response.

Even people I had not interacted with in years were supportive, happy to have some good news during quarantine, and ready to join a surreal Zoom call on short notice. It made me feel more supported during an otherwise stressful time.

Another positive reappraisal I found helpful was that all this work and nitpicking about camera, audio, presentation quality was less for me, more for all the people who got me to this point (for example, my grandmother in Estonia who would not able to see the presentation live, but I knew she would love to see the video).

Similarly, writing these notes helps being done with grad school feel more concrete psychologically. After all, I was in an empty room, talking to a camera, and seemingly got my PhD—maybe the simulation has improved after all.

Please send me a link to your talk or defense! I would love to see it. In addition, feel free to email me with any tips or corrections to these notes.

Thanks to Will Whitney for camera-lending and everyone else in my thesis acknowledgments and otherwise who got me to this point.

Virtual Thesis Defense - Giving and Recording a Stressful Zoom Presentation was originally published by Jaan Lı at Jaan Lı on May 21, 2020.

https://jaan.io/virtual-thesis-defense-recording-zoom-presentation

How does physics connect to machine learning?

Jaan Lı Aug 11, 2017 Updated Aug 11, 2017

Show full content

Mandarin translation available - 用普通话阅读这篇文章: WeChat

I struggled to learn machine learning. I was used to variational tricks, MCMC samplers, and discreet Taylor expansions from years of physics training. Now the concepts were mixed up. The intuitive models of physical systems were replaced by abstract models of ‘data’ and amechanical patterns of cause and effect.

I had to fit these fields together. Physics and machine learning are intricately connected, but it is taking me years to make the overlaps precise. This process requires representing the new with the familiar, mapping jargon from one field to another.

A simple model of magnets—the Ising model—will help illustrate the rich connection between these fields. We first analyze this model with physics intuition. Then we derive the variational principle in physics and show that it recovers the same solution.

We then discover how that very same variational principle in physics opens a window into machine learning. We identify Boltzmann distributions as exponential families to make the mapping transparent, and show how approximate posterior inference is scaled to massive data thanks to the variational principle.

If you have a physics background, I hope you will have a better sense of machine learning and be able to read papers in the field. If you are a machine learner, I hope you will have the context to read a statistical physics paper about mean-field theory and the Ising model.

If this article is confusing, falls short of these goals, or could be improved in any way please email me, @ me, or submit a pull request.

The Ising model, a physics perspective

Consider a lattice of spins that point up or down:

image/svg+xml

What features might make this a convincing model of magnetism?

Think about playing with magnets—if you put them close together, they pull each other closer. They repulse each other when their poles oppose, and if they’re far apart, they don’t attract.

This means neighboring spins should affect each other in our model: if the spins around $s_i$ point upward, it should also want to point upward.

Let’s refer to the spin at location $i$ as $s_i$. A spin can be in one of two states: a spin can point up ($s_i=+1$) or down ($s_i=-1$).

We can capture our intuition about spins being attracted to each other (they want to point in the same direction) or repulsed (they want to point in opposite directions) by introducing a parameter $J$. This interaction parameter captures the interaction strength between spin $i$ and spin $j$.

If two neighboring spins point in the same direction, we’ll have them contribute a term $-J$ to the total energy; if they point in opposing directions, they will contribute $J$.

This lets us write the energy function, or Hamiltonian, of the system:

\[E(s_1, s_2,...,s_N) = -\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^NJ_{ij} s_i s_j\]

Here $J_{ij} = J$ if spins $i$ and $j$ are neighbors, and $J_{ij} = 0$ otherwise. The factor of $\frac{1}{2}$ in front is to account for double counting from the sums over both $i$ and $j$. Note that the system has finitely many spins ($N$ spins).

A spin configuration or state of the system is a specific setting of values for all spins. The set $\{s_1=+1, s_2=+1, s_3=-1, ..., s_N=+1\}$ is an example of a configuration.

The second law of thermodynamics says that at a fixed temperature and entropy, a system will seek configurations that minimize its energy. This lets us reason about interactions.

If the interaction strength $J$ is zero, the spins do not interact and the system has the same energy, zero, for all configurations (i.e. the energy is trivially minimized). But if the interaction strength $J$ is positive, the spins will tend to align to minimize the energy of the system $E(s_1, s_2,...,s_N)$. This corresponds to minimization because of the minus sign convention in front of the sum in the energy function.

Now let’s introduce a magnetic field $H$. Imagine the lattice of spins immersed in a magnetic field, perhaps the ambient field from the earth’s crust. The magnetic field affects every spin independently, and each spin will try to align with the field. We can include the magnetic field in the energy of the system by summing the independent contributions for each spin:

\[E(s_1, s_2,...,s_N) = -\frac{1}{2}\sum_{i, j} J_{ij}s_i s_j - H\sum_i s_i\]

We can reason about the magnetic field strength $H$ by imagining what happens if it is large or small (strong or weak). If $H$ is large and the interactions between spins are weak, the magnetic field term will dominate and the spins will align with the magnetic field to minimize energy. But if the magnetic field is small, it is more difficult to reason about.

Now that we have defined the Ising model and its characteristics, let’s think about our goals. What questions can we answer about this Ising model? For example, if we observe the system, what state will it be in—what are the most likely spin configurations? What is the average magnetization?

The Boltzmann distribution

Can we make our goals more precise and make math from words? To do this, we need to define a distribution over spin configurations. It is straightforward to derive the probability of finding the system in an equilibrium state 1:

\[p(s_1, s_2,...,s_N) = \frac{e^{-\beta E(s_1, s_2,...,s_N)}}{Z}\]

This is the Boltzmann distribution. The numerator is called the Boltzmann factor for a particular configuration. This factor gives high or low weight to a specific state of the system according to the energy for that state.

We query the Boltzmann distribution at a specific configuration of spins to get the probability of finding the system in this state.

For example, say the first spins in our configuration happen to be up, up, down, etc. We plug this in and get $p(s_1=+1, s_2=+1, s_3=-1,...,s_N=+1)=0.7321$. This means this state was pretty likely.

This distribution behaves intuitively: low energy states are more probable than configurations with high energy. For example, if $J=+1$, the spins will align, and the state where all spins point in the same direction is most probable. Why? Because it leads to the most negative energy function, which corresponds to the Boltzmann factor with the largest weight.

The parameter $\beta$ is proportional to the inverse temperature, $\beta = \frac{1}{k_BT}$ and is used for notational convenience. (Specifically, it includes the constant $k_B$ to make the probability density dimensionless.) Temperature affects the model by controlling how important the interactions are. If $T\rightarrow \infty$ we are at a high temperature, and the inverse temperature is small with $\beta \ll 1$, so the interaction strength $J$ is not important and has little effect. But at low temperatures, the inverse temperature is large, so interactions have a large effect on the system’s behavior.

The partition function

The denominator $Z$ is of utmost importance. It ensures that the distribution integrates to $1$ and is thus a valid probability distribution. We need this normalization to calculate properties of the system. Calculating mean values and other moments can only be done with a probability mass function. The name of $Z$ is “partition function” or “normalizing constant”. It is the sum of each state’s Boltzmann factor:

\[Z = \sum_{s_1=\pm1}\sum_{s_2=\pm1}...\sum_{s_N=\pm1}e^{-\beta E(s_1, s_2, ..., s_N)}\]

I explicitly wrote out the sum to illustrate why we can’t evaluate this distribution: we need to sum over all possible configurations. Each spin has two states, and there are $N$ spins. This leads to $2^N$ terms in the sum. For a small system with a hundred spins, this is already greater than the number of atoms in the universe so we can never hope to calculate it 2.

Using the Boltzmann distribution to calculate properties of the system

We arrived at a probability distribution describing which states of the system are likely, however we were stumped by the intractable partition function. Let’s temporarily assume we have infinite computation and can calculate the Boltzmann distribution’s partition function. What are some interesting things we can learn about the system from it’s Boltzmann distribution?

This distribution lets us to calculate properties of the system as a whole by taking expectations (i.e. calculating observable quantities). For example, the magnetization $m$ is the average magnetization over all spins:

\[m = \frac{1}{N} \langle s_1 + s_1 + ... + s_N \rangle = \langle s_i \rangle\]

Why should we care about this magnetization? It tells us about the system as a whole, the macrostate, rather than a specific microstate. We lose specificity because we can’t say anything about the first spin $s_1$, but we learn about how it behaves across all possible states of the rest of the spins.

If the spins are aligned, the system is in an ordered state and the magnetization has a positive or negative sign. If the spins are anti-aligned, the system is disordered and the average magnetization is zero.

These are global phases of the system, and they depend on temperature. If the temperature $T$ goes to infinity, the inverse temperature $\beta$ goes to zero, and all states of the system are equally likely, as described by the Boltzmann distribution. But if the temperature is finite, then some states are more likely than others, and the system can transition between ordered and disordered phases. Such phase transitions and how they depend on the temperature are important for comparing how well this Ising model matches real-world materials 3.

Let’s remember that we can’t evaluate the partition function $Z$. This situation seems hopeless for answering interesting questions like calculating the magnetization. But thankfully, we may be able to simplify the problem by considering each spin independently and figuring out an approximation…

Mean-field theory in physics

Because we cannot evaluate the intractable sum required to calculate the partition function, we turn to mean-field theory.

This is an approximation technique that can still let us answer questions about the system such as the average magnetization. We will study the dependence of the magnetization $m$ on temperature.

To demonstrate the technique, it is easiest to focus on a single spin:

image/svg+xmls 1 s 3 s 2 s 4 s 5 H

The first spin of the Ising model in a magnetic field H. The magnetic field is shown with dashed lines. Its nearest neighbors provide an effective field through the interactions, denoted by lines connecting the spins.

The contribution of this single spin to the total energy of the system is simply the corresponding term in the energy:

\[E_{s_1} = -s_1\left(J\sum_{j=2}^{z+1} s_j + H\right)\]

The sum is over the $z$ nearest neighbors. For the two-dimensional lattice we are considering, $z = 4$. We can rewrite this energy for a single spin in terms of the fluctuations of a spin $s_j$ around its mean value $m = \langle s_j \rangle$. Replacing $s_j = m + (s_j - m)$ gives

\[E_{s_1}= -s_1(zJm + H) -J s_1 \sum_{j=2}^z+1 (s_j - m)\]

The next step is crucial: we will ignore the fluctuations of neighboring spins around their mean value. In other words, we assume that the term $(s_j - m) \rightarrow 0$, so that each of the neighbors of $s_1$ is simply equal to its mean value, $s_j = m$.

When is this true?

When the fluctuations around the mean value are small, such as at low temperature ‘ordered’ phases. This assumption greatly simplifies the Hamiltonian for the spin:

\[E_{s_1}^{MF} = -s_1 (zJm + H).\]

This is the mean-field energy function for a single spin. It is equivalent to a non-interacting spin in an effective magnetic field, $H^{eff}=zJm + H$.

Why do we say this spin is non-interacting? The energy function for the spin only depends on its state, $s_1$, and does not depend on the state of any other spins. We have approximated the interaction effects by the average magnetic field induced by the neighboring spins; this is the mean field.

In this mean-field model, each spin feels the effects of the magnetic field applied to the entire system, $H$, as well as the ‘effective’ mean field from its neighboring spins $zJm$.

We clarify this interpretation by writing

\[H\leftarrow H + \Delta H\]

where $\Delta H = zJm$ is the the average magnetic field (‘mean field’) of the neighbors of each spin.

By ignoring the fluctuations of each spin, we have reduced the complexity of the problem. Instead of $N$ interacting spins, we have $N$ independent spins in a uniform magnetic field $H$ with a small correction $\Delta H$ to account for the effects of interactions.

We write the energy function for the mean-field model as

\[E_{MF}(s_1, s_2,...,s_{N}) = -(H+\Delta H)\sum_{i=1}^Ns_i.\]

This shows that there are no interaction terms anymore (the term $s_is_j$ doesn’t occur in the energy).

In other words, we can treat each spin independently, then combine the results appropriately to model the entire system!

We have radically changed the nature of the problem.

Instead of computing the partition function $Z$ for the whole system, we can now compute it for a single spin.

This is straightforward and has an analytic solution 4:

\[Z_{s_1} = \sum_{s_i = \pm 1} e^{-\beta s_i(H+\Delta H)}\] \[\Rightarrow Z_{s_1} = 2\cosh{[\beta(H+\Delta H)]}\]

The partition function for the entire mean-field model with $N$ spins is then

\[Z_{MF} = (2\cosh{[\beta(H+\Delta H)]} )^N.\]

With the partition function in hand, we can get the Boltzmann distribution and answer questions about the system such as magnetization.

We get the magnetization by taking an expectation over the distribution for the spin. The last step is to require that for any spin $i$, its average magnetization should equal the magnetization of the system as a whole:

\[m = \sum_{s_i=\pm 1} p(s_i) s_i\] \[\Rightarrow m =\tanh{[\beta({H + \Delta H})]}\]

This gives us a clean equation for the magnetization,

\[m = \tanh{[\beta(H + zJm)]},\]

where we used that the mean-field parameter is $\Delta H = zJm$.

This is a formula for the magnetization $m$ as a function of temperature. It has no closed form solution, but we can plot both sides of the equation and see where they intersect to find the implicit solutions (drag the slider to re-plot at a new temperature):

First let’s think about the case when there is no external field, $H = 0$.

For high temperatures, the equation only has one solution: $m = 0$. This aligns with our intuition—if we look at the energy of the system, the inverse temperature $\beta$ goes to zero and all states of the spins are equally likely. They average out to zero.

For low temperatures, we see three solutions: $m=0$ and $m= \pm \lvert m\lvert$. The additional $\pm$ solutions appear when the slope of the $\tanh$ function at the origin is greater than one:

\[\frac{d}{dm} \tanh{[\beta zJm]} |_{m=0} > 1\] \[\Rightarrow \beta zJ > 1\]

The “critical temperature” at which the phase transition occurs is when $\beta zJ = \frac{1}{k_B T} zJ = 1$, or when $k_B T_c = zJ$.

This gives us a testable prediction: we can take a magnetic material, and measure what temperature its phase transition occurs at.

Have we accomplished our goal?

We set out to understand the behavior of this model at various temperatures, in terms of global properties like the magnetization.

By considering a single spin and approximating the effects of other spins as an effective magnetic field, we were able to reduce the complexity of the problem. This allowed us to study phase transitions. However, our exposition felt a little hand-wavy, so let’s dive into a rigorous foundation to justify our intuitions.

Deriving the variational free energy principle: the Gibbs-Bogoliubov-Feynman inequality

Can we learn what tradeoffs we make when we make the assumption of ‘ignoring fluctuations’ of spins around their mean values? Specifically, how can we gauge the quality of results derived from our mean-field theory?

We can rederive the mean-field results in the previous section by directly attacking the problem of the intractable partition function. We can try to approximate this partition function with a simpler one.

Recall that the partition function $Z$ for the system is

\[Z = \sum_{s_1, s_2, ..., s_N}e^{-\beta E(s_1, s_2,...,s_N)}\]

where as before, the energy for the system is

\[E(s_1, s_2,...,s_N) = -\frac{1}{2}\sum_{i, j} J_{ij}s_i s_j - H\sum_i s_i.\]

The complexity of computing the partition function comes from the interaction term with $s_is_j$. We saw that without this term, we were able to reduce the problem to dealing with a system of independent spins.

To derive the variational principle, we will therefore assume an energy function of the form

\[E_{MF}(s_1, s_2,...,s_N) = -(H + \Delta H) \sum_{i=1}^N s_i\]

Previously we saw that the mean-field parameter is $\Delta H = zJm$ which we derived using our physics intuition.

Now we ask the question: is this the optimal effective magnetic field? We can think of $\Delta H$ as a parameter of the mean-field model that we can tune to get the best answers for the original system.

This is known as perturbation theory: we are perturbing the magnetic field of the system and trying to find the optimal perturbation that yields a good approximation to the original system.

What does a ‘good approximation’ entail? Our difficulties were in computing the partition function. We therefore want to approximate the partition function of the original system $Z$ with the partition function of our mean-field system $Z_{MF}$. Let’s hope that $Z_{MF}$ is easy to calculate and does not require a sum on the order of the number of atoms in the universe.

First let’s see if we can express the partition function of the original system $Z$ in terms of our approximation. We can measure how the energy of the mean-field system deviates from the reference system by computing the fluctuations in energy:

\[\Delta E(s_1, s_2,...,s_N) = E(s_1, s_2,...,s_N) - E_{MF}(s_1, s_2,...,s_N)\] \[\Rightarrow E = E_{MF} + \Delta E\]

This lets us reëxpress the original partition function as:

\[Z = \sum_{s_1, s_2,...,s_N} \exp{[-\beta(E_{MF} + \Delta E)]}\] \[\Rightarrow Z =Z_{MF} \sum_{s_1, s_2,...,s_N}\frac{\exp{(-\beta E_{MF})} \exp{(-\beta\Delta E)}}{Z_{MF}}\]

For the next step, we need the definition of an expectation of a function $A$ with respect to the mean-field Boltzmann distribution:

\[\langle A \rangle_{MF} =\sum_{s_1, s_2,...,s_N} \frac{A e^{-\beta E_{MF}}}{Z_{MF}}\]

This means we can write the partition function of the system in terms of the mean-field partition function as:

\[Z=Z_{MF}\langle \exp{(-\beta \Delta E)}\rangle_{MF}\]

This is an exact factorization of the partition function of the original system. It is the mean-field partition function weighted by the expected Boltzmann factor for energy fluctuations away from the reference system.

However, integrating this complicated exponential function is difficult, even with respect to the mean-field system. We’ll simplify it with a classic physics trick—by pulling a Taylor expansion.

Let’s assume that the fluctuations of the energy are small; $\Delta E \ll 1$. Then we can Taylor expand the exponent:

\[\langle \exp{(-\beta \Delta E)}\rangle_{MF}~\approx~\langle 1 - \beta \Delta E + ... \rangle_{MF}\] \[=~1 - \beta \langle \Delta E\rangle_{MF}+...\] \[=\exp{(-\beta \langle \Delta E\rangle_{MF})} + ...\]

We have neglected terms of second order in the fluctuations $\Delta E$. This gives us our first-order perturbation theory result for the partition function of the original system:

\[Z \approx Z_{MF}\exp{(-\beta \langle \Delta E\rangle_{MF})}\] \[\Rightarrow Z \approx Z_{MF}\exp{(-\beta \langle E - E_{MF}\rangle_{MF})}\]

How good is the approximation? We need a simple identity 5: $e^x \geq x + 1$.

Let’s apply this to the expectation in the exact factorization of the partition function, taking $f = -\beta \Delta E$:

\[\langle e^f\rangle = e^{\langle f\rangle} \langle e^{(f - \langle f \rangle)} \rangle\] \[\geq e^{ \langle f \rangle} \langle 1 + f - \langle f \rangle\rangle = e^{\langle f \rangle}\]

Now we can get a lower bound on the partition function:

\[Z = Z_{MF}\langle \exp{(-\beta \Delta E)}\rangle_{MF}\] \[\Rightarrow Z \geq Z_{MF} \exp{[-\beta \langle E - E_{MF}\rangle_{MF}]}\]

This inequality is the Gibbs-Bogoliubov-Feynman inequality. It tells us that with our mean-field approximation, we get a lower bound on the original partition function.

Variational treatment of the Ising model using the Gibbs-Bogoliubov-Feynman inequality

Let’s apply this theory: do we recover the same results for magnetization in the Ising model?

In the mean-field Ising model, we treat each spin independently, so the energy function of the system decomposes into independent parts:

\[E_{MF}(s_1, s_2,...,s_N) = -(H + \Delta H) \sum_{i=1}^N s_i\]

Here $\Delta H$ is the effective magnetic field strength. It is a parameter we can tune to maximize the lower bound on the partition function.

Let’s plug this into the lower bound on the partition function from the Gibbs-Bogoliubov-Feynman inequality. Then we take the derivative to maximize the lower bound:

\[0 = \frac{\partial}{\partial{\Delta H}} Z_{MF} \exp{[-\beta \langle E - E_{MF}\rangle_{MF}]}\]

First, we need to evaluate the expectation:

\[\langle E - E_{MF}\rangle_{MF} = -N(\frac{1}{2} Jz\langle s_1\rangle^2_{MF} - \Delta H \langle s_1 \rangle_{MF}),\]

where we used the mean-field assumption that the spins are independent, hence $\langle s_i s_j\rangle_{MF} = \langle s_i\rangle_{MF} \langle s_j\rangle_{MF}$.

We also assumed that for a large enough system, spins at the edges of the model (boundary conditions) can be ignored, so all spins have the same average magnetization: $\langle s_i\rangle_{MF}\langle s_j\rangle_{MF} = \langle s_1\rangle_{MF}^2$.

Plugging this in to the lower bound on the partition function and differentiating gives

\[0 = \tanh[\beta(H + \Delta H)] - \langle s_1\rangle_{MF} - Jz\langle s_1\rangle_{MF} \frac{\partial}{\partial \Delta H} \langle s_1\rangle_{MF} + \Delta H \frac{\partial}{\partial \Delta H} \langle s_1\rangle_{MF}\] \[\Rightarrow \Delta H = Jz \langle s_1\rangle_{MF}.\]

We used that $m = \langle s_1\rangle_{MF} = \tanh{[\beta({H + \Delta H})]}$ from before.

This confirms our earlier reasoning, that the optimal mean-field parameter is $\Delta H = Jzm$. There were three steps to this process. We started by defining the model we cared about, we wrote down a mean-field approximation to it, and we maximized a lower bound on the partition function.

The machine learning perspective on the Ising model

Now let’s frame what we just did in the language of machine learning. More specifically, let’s think in terms of probabilistic modeling.

We need some definitions to see how the variational principle is equivalent to variational inference in machine learning.

The Ising model is an undirected graphical model or Markov random field. We can represent the conditional dependencies of the model using a graph; the nodes in the graph are random variables. These random variables are the spins of the Ising model, so two nodes are connected by an edge if they interact. This lets us encode the joint distribution of the random variables in the following diagram:

image/svg+xmls i

A representation of the Ising model as an undirected graphical model. The nodes are random variables (spins) and edges denote conditional dependencies between their distributions.

The Boltzmann distribution is a parameterization of the joint distribution of this graphical model. This figure looks very similar to the physics spin-based representation—the spins are random variables. We can also write the joint distribution of the nodes in exponential family form. Exponential family distributions let us reason about a broad class of models and deserve a header.

Exponential families

A way to parameterize probability distributions like the Ising model is with exponential families. These are families of distributions that support a representation in this specific, convenient mathematical form:

\[p(x ; \eta) = h(x)e^{\eta^\top t(x) - a(\eta)}\]

Here $\eta$ is called the natural parameter, $h(x)$ is the base measure, $t(x)$ the sufficient statistic and $a(\eta)$ is the log normalizer, or log partition function. I was confused about exponential families for a long time and found concrete derivation helpful.

For example, we are used to seeing the Bernoulli distribution in the following form 6:

\[p(x ; \pi) = \pi^x(1-\pi)^{(1-x)}\]

We can rewrite this in exponential family form:

\[p(x; \eta) = \exp{\{x\log \pi + (1-x)\log{(1-\pi)}\}}\] \[\Rightarrow p(x; \eta)=\exp{\{x\log \frac{\pi}{1-\pi} + \log{(1-\pi)}\}}\]

Comparing to the above formula for exponential families reveals the natural parameter, base measure, sufficient statistic, and log normalizer for the Bernoulli, given by $\eta = \log{\frac{\pi}{1-\pi}}$, $t(x) = x$, $a(\eta) = -\log{(1-\pi)} = \log{(1+e^\eta)}$, and $h(x) = 1$ respectively.

More connections to physics: the log normalizer is the log of the partition function. This is made clear in the exponential family form of the Bernoulli: $\log Z = \log \sum_{x\in\{0,1\}} e^{\eta x} = \log{(1+e^\eta)}$. We can now identify the parameter $\eta$ as a analogous to temperature, with $x$ as a spin. We’ve identified the Ising model’s exponential family form!

The exponential family form of the Ising model

Let’s connect this to the energy function of the Ising model by writing its Boltzmann distribution in exponential family form:

\[p(s_1, s_2,...,s_N; \beta, J, H) = \frac{e^{-\beta E(s_1, ..., s_N)}}{Z}\] \[p(s_1, s_2,...,s_N; \beta, J, H) = \exp{\{-\sum_{(i, j)\in E}\beta Js_is_j + -\sum_{i \in V}\beta Hs_i - \log{Z}\}}\] \[p(s_1, s_2,...,s_N; \theta)=\exp{\{ -\sum_{(i, j)\in E} \theta_{ij}s_is_j -\sum_{i \in V} \theta_i s_i - a(\theta)\}}\]

We have introduced some new notation common to graphical models: we have specified a joint distribution over a collection of random variables $\{s_1, ..., s_N\}$ that live on the graph over vertices $V$, joined by edges in the set $E$.

This is the exponential family form of the Ising model, a probability model with model parameters $\theta$. To equate it to the form we saw earlier, set $\theta_{ij} = \frac{1}{2}\beta J$ if $i$ and $j$ share an edge (i.e. they are neighbors), and set $\theta_i = H$.

For the Ising model, we can see that there are two sets of model parameters. The spin-spin interaction parameter multiplied by the inverse temperature $\beta J$ controls the effects of each edge in the graph. The inverse temperature multiplied by the magnetic field $\beta H$ affects each spin independently. We can also say that the inverse temperature $\beta$ is a global model parameter. For a fixed interaction and magnetic field, we can vary the temperature to index a specific model.

This is a subtle but important point. Our joint distribution over the set of random variables (the $N$ spins) is indexed by the set of model parameters. By varying the inverse temperature parameter $\beta$, we are actually selecting a specific model (the Ising model at that temperature). Ditto for a specific choice of the spin-spin interaction parameter $J$.

What questions can we ask about the model?

Computing the magnetization $m = \frac{1}{N}\langle s_1 + ... + s_N \rangle = \langle s_i \rangle$ means calculating the expectation $\mathbb{E}_{p(s_i)}[s_i]$. In probability language, this means calculating the marginal expectation of a node $i$.

But calculating the marginal distribution is intractable for reasons we already discussed: it requires marginalizing over all other nodes $j \neq i$:

\[p(s_i) = \sum_{s_1=\pm 1} ... \sum_{s_{i-1}=\pm 1}\sum_{s_{i+1}=\pm 1}...\sum_{s_N=\pm1} p(s_1,...,s_{i-1}, s_i, s_{i+1}, ..., s_N)\]

The situation is hopeless: not only do we need to calculate the normalizing constant for the joint distribution of $N$ nodes, which has $2^N$ terms, but then we need to marginalize over $N-1$ variables (another $2^{N-1}$ terms).

This is identical to what we saw in the partition function, when thinking about this model from a physics perspective.

Can we still answer questions about the marginal distributions by resorting to a variational principle?

Variational inference in machine learning

If we could calculate the sum over all configurations of random variables, we could calculate the partition function. But we can’t, because the sum grows as $2^N$.

With our physics hat on, our strategy was to approximate to the partition function.

From a machine learning perspective, this technique is known as variational inference. We vary something simple to infer something complicated.

Let’s look at how the variational free energy is derived in machine learning and used to approximate partition functions.

We have a probability model of random variables $p_\theta(s_1, ..., s_N)$ and we seek to calculate its normalizing constant or partition function 7.

Let’s construct a simpler probability distribution $q_\lambda(s_1, ..., s_N)$, parameterized by $\lambda$, and use it to approximate our model.

How good is our approximation? One way of measuring how close our approximation is to our goal distribution is with the Kullback-Leibler divergence.

This divergence between $q$ and $p$, or relative entropy, measures the amount of information (in bits or nats) that is lost when using $q$ to approximate $p$.

This gives us a criteria with which to vary our approximation. We vary the $\lambda$ parameter of our approximation until we minimize the approximation error, as measured by the Kullback-Leibler divergence.

The KL divergence is written with a double vertical bar as

\[\textrm{KL}(q(s) \mid\mid p(s)) = \int q(s) \log \frac{q(s)}{p(s)}ds\]

Let’s assume we are dealing with an exponential family distribution such as the Ising model. We let $p$ be the Boltzmann distribution for our model with the known energy function $E(s_1, ..., s_N)$:

\[p(s) = \frac{e^{-\beta E(s)}}{Z}\]

We assume that $q$ is a family of distributions with another energy function that has parameters $\lambda$:

\[q_\lambda(s) = \frac{e^{-\beta E_\lambda(s)}}{Z_q}\]

To measure how much information we lose when we use our approximation $q$ instead of $p$, we plug them into the Kullback-Leibler divergence:

\[\textrm{KL}(q_\lambda(s) \mid \mid p(s)) = \int q_\lambda(s) \log q_\lambda(s) - \int q_\lambda(s) \log \exp{(-\beta E(s))} + \log Z\] \[= \mathbb{E}_{q_\lambda} [\log q_\lambda(s)] - \mathbb{E}_{q_\lambda}[-\beta E(s)] + \log Z\] \[= -\mathcal{L(\lambda)} + \log Z\]

where we have defined the variational lower bound $\mathcal{L}(\lambda)$ as

\[\mathcal{L}(\lambda) = \mathbb{E}_{q_\lambda}[-\beta E(s)] - \mathbb{E}_{q_\lambda}[\log q_\lambda(s) ]\]

We can move the variational lower bound to the other side of the equation to get the following identity:

\[\log Z = \textrm{KL}(q \mid\mid p) + \mathcal{L}(\lambda)\]

With Jensen’s inequality it is easy to show that the KL divergence is always greater than or equal to zero. This means that if we make $\mathcal{L}(\lambda)$ bigger, the KL divergence must get smaller (i.e. our approximation must improve). Thus we can lower bound the partition function:

\[\log Z \geq \mathcal{L}(\lambda)\]

This means we can vary the parameters $\lambda$ of our approximation to improve the lower bound, and get a better and better approximation to the partition function!

Note that in the definition of the variational lower bound, we do not need to worry about the arduous task of calculating the partition function: it does not depend on $\lambda$.

This is awesome: we have constructed an approximation $q_\lambda$ to our probability model $p$ and found a way to vary its parameters so that our approximation gets better and better.

The interesting part is that we get can improve the approximation to our model $p$ without calculating its intractable partition function. We only need to evaluate its energy function $E(s)$ which is cheap to compute.

Is this too clever to be true? Have we surrendered anything? We have lost the ability to measure how good our approximation is, in absolute terms—for that, we still need to calculate the partition function to compute the KL divergence. We do know that as long as our lower bound $\mathcal{L}(\lambda)$ increases as we vary $\lambda$, our approximation gets better, and this is sufficient for a variety of problems.

Variational inference as the Gibbs-Bogoliubov-Feynman inequality!

Let’s see if this is the same as the Gibbs-Bogoliubov-Feynman inequality we saw in physics. Recall that the inequality is

\[Z \geq Z_{MF} \exp{[-\beta \langle E - E_{MF}\rangle_{MF}]}.\]

Taking logarithms:

\[\log Z \geq - \langle \beta E\rangle_{MF} + \langle \beta E_{MF}\rangle_{MF} + \log Z_{MF}\] \[\Rightarrow \log Z \geq \mathbb{E}_{q_\lambda}[-\beta E(s)] - \mathbb{E}_{q_\lambda} [\log q_\lambda(s) ]\] \[\Rightarrow \log Z \geq \mathcal{L}(\lambda)\]

Where we have identified that the variational family we are using, is the mean-field Boltzmann distribution $q_\lambda(s) = \prod_i \frac{\exp(-\beta E_{MF}(s))}{Z_{MF}}$. Again, $\lambda$ denotes the variational parameters that we vary to maximize the lower bound 8.

This shows that variational inference in machine learning—maximizing a lower bound on the partition function—is exactly the Gibbs-Bogoliubov-Feynman inequality in action.

The evidence lower bound in approximate posterior inference

In machine learning we care about patterns in data. This gives rise to the concept of latent variables, unobserved random variables that capture patterns in observed data.

For example, in linear regression we might posit a linear relationship between someone’s age and their income. This scalar coefficient captures a latent pattern that we seek to infer from many examples of (age, income) tuples.

We refer to a probability model as a model of latent variables $z$ and data $x$. The posterior distribution of latent variables given observed data is written $p(z \mid x)$.

What is a posterior? In our regression example of the relationship between age and income, we want the posterior distribution of the regression coefficient after observing data. Our choice of prior on the coefficient is a modeling decision and reflects our belief about the statistical relationship we hope to observe.

The posterior is given by Bayes’ rule:

\[p(z \mid x) = \frac{p(x \mid z) p(z)}{\int p(x, z) dz}\]

The denominator is the evidence; the marginal distribution of the data: $p(x) = \int p(x, z) dz$. This is the normalizer of the joint distribution of latent variables and data, or the partition function. This partition function is a sum over all configurations of random variables, and is intractable as we saw twice before.

Can we still do posterior inference despite the intractable partition function?

The refrain is familiar: we have an intractable sum in our partition function, but we can approximate it using the tools we developed earlier! Variational inference to the rescue. Let’s write out the variational lower bound on the partition function:

\[\log Z = \log p(x) \geq \mathcal{L}(\lambda) = \mathbb{E}_{q_\lambda}[\log p(x, z)] - \mathbb{E}_{q_\lambda}[\log q_\lambda (z)]\]

Again, by varying the parameters $\lambda$ we can learn a good approximate posterior distribution $q_\lambda(z)$ to approximate the posterior we care about but can’t calculate, $p(z \mid x)$.

If we are using the variational method to learn an approximate posterior, our partition function is the evidence $\log p(x)$. We thus refer to the variational lower bound $\mathcal{L}(\lambda)$ as the Evidence Lower Bound or ELBO and speak of maximizing the ELBO to learn a good approximate posterior distribution.

This technique has been used in machine learning for the past two decades. It is becoming popular because intractable partition functions come with the need to analyze large datasets. Because the variational principle relies on optimizing a lower bound, the field has borrowed heavily from the optimization literature to scale Bayesian inference to massive data. It’s an exciting area, as new techniques from stochastic optimization may enable us to explore new physics and machine learning models.

Connections: are machine learning techniques useful in physics?

There are many techniques for approximating partition functions developed in the machine learning community that may find use in physics.

For example, black box variational inference and automatic differentiation variational inference are generic methods that may be useful in physics. They develop frameworks for constructing expressive approximate distributions and efficient optimization techniques.

Question for physicists familiar with variational methods: is stochastic optimization used in variational methods? Would this be useful?

Connections: could tools from physics be useful in machine learning?

Yes! The Gibbs-Bogoliubov-Feynman inequality was originally developed in physics and found its way to machine learning through Michael Jordan’s group at MIT in the 90s.

There seems to be a separate literature on constructing flexible families of distributions to approximate distributions. The replica trick, renormalization group theory, and others are just some topics that are beginning to make their way from statistical physics to machine learning.

Another example of tools from physics used in machine learning is operator variational inference. In this work, we developed a framework for constructing operators (such as the KL divergence) that measure how good an approximation is. The framework enables making explicit the tradeoffs between how good our approximation is and how much computation a variational method requires. The Langevin-Stein operator is equivalent to the Hamiltonian operator in physics (note) and was originally developed in a Physical Review Letters paper.

A fun question to ponder is “why KL divergence?” and the physics perspective is illuminating. It corresponds to the first-order Taylor expansion of the partition function and comes with assumptions about the non-equilibrium perturbed distribution. Does the second-order Taylor expansion correspond to another divergence and yield more accurate solutions?

I recently learned about replica theory. The replica trick is a technique for calculating the partition function of a system exactly, using an insane formula. It begs the question: what assumptions do we need to use this for probabilistic graphical models?

I’m excited to see more work in this area as physicists migrate to data science and machine learning.

How can we make transitions faster? How can we efficiently move techniques between machine learning and physics? Would code samples be helpful?

This post is an attempt at mapping the language from one community to another. Another idea is a long review paper that to give detailed examples of models solved within a statistical physics framework (with mean-field methods, replica theory, renormalization theory, etc) and solved with modern variational inference from a machine learning perspective (black box variational inference, stochastic optimization, etc). This would highlight how the fields complement each other.

Glossary

Expectations: the angle brackets $\langle ~~\cdot~~\rangle$ denote an expectation. In the machine learning literature, this is denoted as $\mathbb{E}_p[~~\cdot~~]$ for the expectation of a quantity with respect to the distribution $p$. For example, $\langle f(\vec{s}) \rangle$ denotes an expectation of a function of the spins $f(\vec{s})$. The expectation is implicitly with respect to the Boltzmann distribution: $\langle f(\vec{s}) \rangle = \mathbb{E}_p[f(\vec{s})] = \sum_{\{s_1, ..., s_N\}} f(\vec{s}) p(\vec{s})$ $= \sum_{\{s_1, ..., s_N\}} f(\vec{s})=\frac{e^{-\beta H(\vec{s})}}{Z}$
Spins in physics are called random variables in statistics and machine learning.
The evidence lower bound in variational inference is the negative free energy in physics terminology.

Anything to add or fix in this article to reduce confusion and increase clarity? Please email me, tweet, or submit a pull request.

References

Peterson & Anderson (1987) used solutions to time-dependent Ising models to learn the parameters of Boltzmann machines. This is a canonical reference for the ‘start’ of variational inference as it is known in the machine learning community.
You can go deep into Ising models: there are hundreds of lectures and references on line. Here are the sources I used for these notes: from Basel and Munich.
Dave’s course, Foundations of Graphical Models
Wainwright & Jordan (2008) is challenging but worthwhile.
David MacKay’s Information Theory, Inference, and Learning Algorithms has a section on variational free energy (Chapter 33, p. 422).
David Chandler’s Introduction to Modern Statistical Mechanics (1987) has a simple derivation of the variational free energy (Section 5.1, pp. 135-138) that I followed in this exposition.
Feynman, Statistical Mechanics - A set of lecture notes (1972) derives the variational free energy using a perturbation expansion (Section 2.11, pp. 67-71).
Parisi’s Statistical Field Theory (1988) derives the variational principle in three different ways (Section 3.2, pp. 24-31).
Matthew Beal’s thesis has interesting references, and Rich Turner has notes on correspondences between physics and machine learning.

Thanks to Bohdan Kulchytskyy, Florian Wentzel, Siddharth Mishra-Sharma, Smiti Kaul, Guillaume Verdon, Henri Palacci, Sam Ritter, Mattias Fitzpatrick, and Sophie Kleber for comments and encouragement. Image credits: Freepik for iconography, and Analytical Scientific for the Newton’s cradle image.

Addendum

This blog post ended up seeding the first several chapters of my thesis.

Footnotes

Derivation ↩
For a tiny system, e.g. with three spins, we have $8$ states and the sum is doable - but the system is uninteresting. ↩
For example, the magnetization of dysprosium aluminium garnet at low temperatures is exactly described by this model. ↩
To see this, recall that $\cosh x = \frac{e^x + e^{-x}}{2}$ ↩
Visual proof that $e^x \geq x + 1$. ↩
The semicolon notation means “the distribution over $x$ is parameterized in terms of the parameter $\pi$”. ↩
Writing the parameters of a distribution as a subscript ($p_\theta(s)$) is shorthand for writing them after the semicolon ($p(s; \theta)$). ↩
In the variational treatment of the Ising model we had one variational parameter, the perturbation to the static magnetic field $\lambda = \Delta H$. ↩

How does physics connect to machine learning? was originally published by Jaan Lı at Jaan Lı on August 11, 2017.

https://jaan.io/how-does-physics-connect-machine-learning

food2vec - Augmented cooking with machine intelligence

Jaan Lı Jan 22, 2017 Updated Jan 22, 2017

Show full content

TL;DR: Check out the tools demo to explore food analogies and recommendations, or scroll down for an interactive map of a hundred thousand recipes from around the world.

I haven’t eaten in five days. I dream of food. I study food. Deep in ketosis, my body has adapted to consume itself: I am food. There is no better time to dig into modeling grub.

Machine intelligence has changed your life, from how you listen to music through Discover Weekly playlists, consume news through Facebook, or talk to your hand computer’s friendly digital assistant. But why hasn’t it changed how we eat? Can we modify the ingredients of language processing algorithms to get insights about food? If you tell me what you want to eat, can I recommend complementary foods, much like Spotify recommends complementary songs?

Word embeddings are a useful technique for analyzing discrete data. Say we use $170,000$ words from the Oxford English dictionary. We can represent each word (such as “food”) as a vector as follows: a list of $169,999$ zeros, with a single $1$ at the location of the word in the vocabulary. In our case, “food” may be at location $29,163$ near other words beginning with the letter f. Then the vector for “food” would look like:

\[[0, 0, 0, ..., 0, 0, 1, 0, 0, ..., 0].\]

However, this is inadequate for comparing words. To compare documents and get useful insights from our data, we need to aggregate over $170,000$ dimensions for each word, which takes far too long. Can we do better?

Embeddings let us reduce the dimensionality of the problem, and give us a powerful representation of language. We can build a model of language where we assign a hundred random numbers to each word. To train the model, we use these hundred numbers of each word to predict their context. The “context” of a word consists of its surrounding words. This is the main idea: the context means that words that occur in similar contexts should have similar meanings. We tweak the numbers assigned to a word to make them better at predicting words in the context. Initially, the random numbers assigned to a word will be bad at predicting words in the context. But gradually, through this process of tweaking the model’s predictions of surrounding words, we get a hundred numbers that are far from random. The hundred numbers representing each word will capture part of its meaning: similar words will cluster together because they occur in each other’s contexts, and words with different meanings are pushed far apart (out-of-context). By representing each word as an embedding in $100$ dimensions, we have reduced the dimensionality more than a thousandfold from $170,000$ and gained a better representation of language.

For modeling food, we have a collection of recipes. We can define the context of an ingredient in a recipe to be the rest of the foods in the recipe. This demonstrates the flexibility of embeddings: by making a small change in the definition of the context, we can now apply it to a totally different kind of data.

Food similarity map

After training the embedding algorithm on a collection of $95, 896$ recipes, we get $100$-dimensional embeddings for each food. Humans can’t visualize high dimensions, so we use an approximation technique to visualize similarity between the foods in two dimensions.

Here is a similarity map of the $2,087$ ingredients used in the recipes. Hover over a point to see which food it represents:

The map of foods is reasonable. Ingredients from Asia cluster together, as do ingredients used in European and North American cooking.

Recipe embedding map

We can generate an embedding for a recipe by taking the average of its ingredients’ embeddings. Here is a map of $95, 896$ recipes from around the world. Hover over a point to see the recipe, and click on the cuisine legend on the right to show or hide certain regions:

IMPORTANT: you are about to download 15MB of data. Click here to access the map, zoom in, and discover new flavors. Is this the fastest way to browse 100k recipes by similarity?

Interesting patterns emerge. Asian recipes cluster together, as do Southern European recipes. Northern European and American foods are all over the place, maybe because of transmission of recipes due to migration, or over-representation in the data.

Food similarity tool

Access the tool at this link. We can calculate food similarity by looking at which food is closest in the high dimensional space in the embeddings.

These mostly make sense - foods are closest to other foods they appear with in recipes:

Cheese is closest to macaroni
Sesame oil is closest to egg noodle
Milk is closest to nutmeg
Olive oil is closest to parmesan cheese

Food analogy tool

Access the tool here. Food analogies, like word analogies, are calculated with vector arithmetic. For the analogy “Food A is to food B, as food C is to food D”, the goal is to predict a reasonable food D. We can do this by subtracting food B from food A, then adding food C. For example, calculating $(bacon - egg) + orangejuice$ in embedding space will yield an embedding. The closest embedding to this is $coffee$ in our model of food. The classic example from word embeddings is $(king - man) + woman = queen$. Is this intuitive? King is to man as woman is to queen makes sense in natural language, but food analogies are less clear. With practice, we may be able to train our taste detectors and devise hypotheses to test in the realm of food. I also included cuisine embeddings by representing them as the average of their recipes’ embeddings.

Some of these are more plausible than others:

Egg is to bacon as orange juice is to coffee.
Bread is to butter as roast beef is to sage.
Smoked salmon is to dill as lamb is to asparagus.
South Asian is to rice as Southern European is to thyme.
Rice is to sesame seed as macaroni is to pimento.
Roasted beef is to green bell pepper as pork sausage is to fenugreek.

Recipe recommendation tool

Access the tool here. We can use our model of food as a recommendation system for cooks. By taking the average embedding for a set of foods, we can look up foods with the closest embeddings.

For example, I am a lifelong aficionado of peanut butter jam sandwiches. I entered my usual favorite: white bread, butter, peanut butter, honey. The top recommendation was: strawberry. I’ve never tried that, and it’s pretty good! I happily broke my fast with it. For the recipe of lamb, cumin, tomato, the top recommendation is raisin - also reasonable and interesting. Other recommendations are a bit wackier, so best of luck.

If you end up adding an ingredient to your food based on these tools, I’d love to hear how it tasted: ping me on Twitter or email!

What’s next?

Figuring out the right user interface to explore these models. The code for the plots and recommendation tools is on github. It would be great to make these mobile-friendly and test other ways of presenting recommendations from the model to users.
word2vec is not the best model for this. Multi-class regression should work well, and I added a working demo of this to the repo. This is a rare case where the vocabulary size (number of ingredients) is very small, so we can fit both models and compare them. This could reveal idiosyncrasies in the non-contrastive estimation loss used in word2vec and provides an interesting testbed.
Scaling up the data: Do you have a larger dataset of recipes, or do you know how to scrape one? I’d love to check it out. This would also fix bias in the data as the majority of the recipes are currently North American.
Testing out recipe analogies combined with food analogies: this may be more intuitive for us humans. For example, “pancakes are to maple syrup, as an omelette is to cheese” could be easier to think about than analogies with individual ingredients.

Resources

This NYT piece, The Great AI Awakening, does a much better job at describing embeddings than I can
Wesley has a neat paper on a similar approach: diet2vec
Sanjeev Arora’s research has good explanations for the analogy properties of embeddings
The t-SNE algorithm for visualizing high-dimensional embeddings
The original Nature Scientific Report with the data
Dave taught a fantastic class that helped me understand embeddings
Maja’s paper on exponential family embeddings generalizes word2vec to other distributions that would be neat to try on this data (word2vec can be interpreted as a Bernoulli embedding model with biased gradients)

Thanks to David Blei for the idea, Peter Bearman for presenting his work to our group, MealMakeOverMoms for the mise photo, Anthony for open-sourcing the embedding browser on which ours is based, and Plotly for open-sourcing their fantastic plotting library.

Feel free to ping me on Twitter or email with feedback or ideas!

Discussion on Hacker News and Reddit. Also see slides from a talk at the New York Times on this project.

food2vec - Augmented cooking with machine intelligence was originally published by Jaan Lı at Jaan Lı on January 22, 2017.

https://jaan.io/food2vec-augmented-cooking-machine-intelligence

Variational Autoencoder Perspectives.md

Jaan Lı Jul 24, 2016 Updated Jul 24, 2016

Show full content

### Takeaway: why the neural net perspective limits us I hope you are convinced that reasoning about the variational autoencoder is less ambiguous and less confusing from the perspective of variational inference in probability models. In neural net language, the variational autoencoder refers to an encoder, a decoder, and a loss function. In probability model terms, the variational autoencoder refers to approximate inference in a latent Gaussian model, where the approximate posterior and model likelihood are parametrized by neural nets (the inference and generative networks). The sentence describing the variational autoencoder in neural net terms is unclear: What is the encoder? What does the decoder mean? What is the loss function? Each term requires further explanation. In contrast, the probability model language gives us an objective function (the ELBO) for free, and we can simply state that we parametrize the approximate posterior and model with neural nets. Here are more reasons why we should favor the probability model perspective on variational autoencoders: * *Separating model and inference*: Shakir [makes this point well](http://blog.shakirm.com/2015/03/a-statistical-view-of-deep-learning-ii-auto-encoders-and-free-energy/). Rather than being limited to an 'encoder' in neural net terms, we can think of the probability model at hand, $$ p(x, z) $$ separately from the approximate inference scheme. This lets us choose from a variety of methods, rather than thinking only in terms of amortized inference using a neural net. It is our choice whether to explore other (perhaps better) methods such as mean-field variational inference or MCMC/HMC/Langevin dynamics to learn the parameters of the model. * *Composability*: the moment we add a second layer of latent variables to our model that depend on the first layer, the encoder/decoder framework breaks down. How should we parametrize the inference network? Can we still do amortized inference? The framework of probability models can help us use build more complex models from basic building blocks, and gives us clear frameworks for how to do inference. Thinking in terms of encoders is dangerous for top-down inference, as it is unclear how to parametrize the encoder for any more than one layer of latent variables. * *Regularization is free*: in neural net terms, we discussed 'regularizer' term in the loss function (the KL divergence between the approximate posterior and prior). This comes out of the blue if one is not familiar with variational inference. But in probability model language, it is simply and alternate form of the ELBO, and we can immediately think about alternative priors that may be more appropriate for the data we wish to model.

Variational Autoencoder Perspectives.md was originally published by Jaan Lı at Jaan Lı on July 24, 2016.

https://jaan.io/variational-autoencoder-perspectives.md

Tutorial - What is a variational autoencoder?

Jaan Lı Jul 18, 2016 Updated Jul 18, 2016

Show full content

Why do deep learning researchers and probabilistic machine learning folks get confused when discussing variational autoencoders? What is a variational autoencoder? Why is there unreasonable confusion surrounding this term?

There is a conceptual and language gap. The sciences of neural networks and probability models do not have a shared language. My goal is to bridge this idea gap and allow for more collaboration and discussion between these fields, and provide a consistent implementation (Github link). If many words here are new to you, jump to the glossary.

Variational autoencoders are cool. They let us design complex generative models of data, and fit them to large datasets. They can generate images of fictional celebrity faces and high-resolution digital artwork.

Variational autoencoder applied to faces. — Fictional celebrity faces generated by a variational autoencoder (by Alec Radford).

These models also yield state-of-the-art machine learning results in image generation and reinforcement learning. Variational autoencoders (VAEs) were defined in 2013 by Kingma et al. and Rezende et al..

How can we create a language for discussing variational autoencoders? Let’s think about them first using neural networks, then using variational inference in probability models.

The neural net perspective

In neural net language, a variational autoencoder consists of an encoder, a decoder, and a loss function.

The encoder compresses data into a latent space (z). The decoder reconstructs the data given the hidden representation.

The encoder is a neural network. Its input is a datapoint $x$, its output is a hidden representation $z$, and it has weights and biases $\theta$. To be concrete, let’s say $x$ is a 28 by 28-pixel photo of a handwritten number. The encoder ‘encodes’ the data which is $784$-dimensional into a latent (hidden) representation space $z$, which is much less than $784$ dimensions. This is typically referred to as a ‘bottleneck’ because the encoder must learn an efficient compression of the data into this lower-dimensional space. Let’s denote the encoder $q_\theta (z \mid x)$. We note that the lower-dimensional space is stochastic: the encoder outputs parameters to $q_\theta (z \mid x)$, which is a Gaussian probability density. We can sample from this distribution to get noisy values of the representations $z$.

The decoder is another neural net. Its input is the representation $z$, it outputs the parameters to the probability distribution of the data, and has weights and biases $\phi$. The decoder is denoted by $p_\phi(x\mid z)$. Running with the handwritten digit example, let’s say the photos are black and white and represent each pixel as $0$ or $1$. The probability distribution of a single pixel can be then represented using a Bernoulli distribution. The decoder gets as input the latent representation of a digit $z$ and outputs $784$ Bernoulli parameters, one for each of the $784$ pixels in the image. The decoder ‘decodes’ the real-valued numbers in $z$ into $784$ real-valued numbers between $0$ and $1$. Information from the original $784$-dimensional vector cannot be perfectly transmitted, because the decoder only has access to a summary of the information (in the form of a less-than-$784$-dimensional vector $z$). How much information is lost? We measure this using the reconstruction log-likelihood $\log p_\phi (x\mid z)$ whose units are nats. This measure tells us how effectively the decoder has learned to reconstruct an input image $x$ given its latent representation $z$.

The loss function of the variational autoencoder is the negative log-likelihood with a regularizer. Because there are no global representations that are shared by all datapoints, we can decompose the loss function into only terms that depend on a single datapoint $l_i$. The total loss is then $\sum_{i=1}^N l_i$ for $N$ total datapoints. The loss function $l_i$ for datapoint $x_i$ is:

\[l_i(\theta, \phi) = - \mathbb{E}_{z\sim q_\theta(z\mid x_i)}[\log p_\phi(x_i\mid z)] + \mathbb{KL}(q_\theta(z\mid x_i) \mid\mid p(z))\]

The first term is the reconstruction loss, or expected negative log-likelihood of the $i$-th datapoint. The expectation is taken with respect to the encoder’s distribution over the representations. This term encourages the decoder to learn to reconstruct the data. If the decoder’s output does not reconstruct the data well, statistically we say that the decoder parameterizes a likelihood distribution that does not place much probability mass on the true data. For example, if our goal is to model black and white images and our model places high probability on there being black spots where there are actually white spots, this will yield the worst possible reconstruction. Poor reconstruction will incur a large cost in this loss function.

The second term is a regularizer that we throw in (we’ll see how it’s derived later). This is the Kullback-Leibler divergence between the encoder’s distribution $q_\theta(z\mid x)$ and $p(z)$. This divergence measures how much information is lost (in units of nats) when using $q$ to represent $p$. It is one measure of how close $q$ is to $p$.

In the variational autoencoder, $p$ is specified as a standard Normal distribution with mean zero and variance one, or $p(z) = Normal(0,1)$. If the encoder outputs representations $z$ that are different than those from a standard normal distribution, it will receive a penalty in the loss. This regularizer term means ‘keep the representations $z$ of each digit sufficiently diverse’. If we didn’t include the regularizer, the encoder could learn to cheat and give each datapoint a representation in a different region of Euclidean space. This is bad, because then two images of the same number (say a 2 written by different people, $2_{alice}$ and $2_{bob}$) could end up with very different representations $z_{alice}, z_{bob}$. We want the representation space of $z$ to be meaningful, so we penalize this behavior. This has the effect of keeping similar numbers’ representations close together (e.g. so the representations of the digit two ${z_{alice}, z_{bob}, z_{ali}}$ remain sufficiently close).

We train the variational autoencoder using gradient descent to optimize the loss with respect to the parameters of the encoder and decoder $\theta$ and $\phi$. For stochastic gradient descent with step size $\rho$, the encoder parameters are updated using $\theta \leftarrow \theta - \rho \frac{\partial l}{\partial \theta}$ and the decoder is updated similarly.

The probability model perspective

Now let’s think about variational autoencoders from a probability model perspective. Please forget everything you know about deep learning and neural networks for now. Thinking about the following concepts in isolation from neural networks will clarify things. At the very end, we’ll bring back neural nets.

In the probability model framework, a variational autoencoder contains a specific probability model of data $x$ and latent variables $z$. We can write the joint probability of the model as $p(x, z) = p(x \mid z) p(z)$. The generative process can be written as follows.

For each datapoint $i$:

Draw latent variables $z_i \sim p(z)$
Draw datapoint $x_i \sim p(x\mid z)$

We can represent this as a graphical model:

The graphical model representation of the model in the variational autoencoder. The latent variable z is a standard normal, and the data are drawn from p(x|z). The shaded node for X denotes observed data. For black and white images of handwritten digits, this data likelihood is Bernoulli distributed.

This is the central object we think about when discussing variational autoencoders from a probability model perspective. The latent variables are drawn from a prior $p(z)$. The data $x$ have a likelihood $p(x \mid z)$ that is conditioned on latent variables $z$. The model defines a joint probability distribution over data and latent variables: $p(x, z)$. We can decompose this into the likelihood and prior: $p(x,z) = p(x\mid z)p(z)$. For black and white digits, the likelihood is Bernoulli distributed.

Now we can think about inference in this model. The goal is to infer good values of the latent variables given observed data, or to calculate the posterior $p(z \mid x)$. Bayes says:

\[p(z \mid x) = \frac{p(x \mid z)p(z)}{p(x)}.\]

Examine the denominator $p(x)$. This is called the evidence, and we can calculate it by marginalizing out the latent variables: $p(x) = \int p(x \mid z) p(z) dz$. Unfortunately, this integral requires exponential time to compute as it needs to be evaluated over all configurations of latent variables. We therefore need to approximate this posterior distribution.

Variational inference approximates the posterior with a family of distributions $q_\lambda(z \mid x)$. The variational parameter $\lambda$ indexes the family of distributions. For example, if $q$ were Gaussian, it would be the mean and variance of the latent variables for each datapoint $\lambda_{x_i} = (\mu_{x_i}, \sigma^2_{x_i}))$.

How can we know how well our variational posterior $q(z \mid x)$ approximates the true posterior $p(z \mid x)$? We can use the Kullback-Leibler divergence, which measures the information lost when using $q$ to approximate $p$ (in units of nats):

\[\mathbb{KL}(q_\lambda(z \mid x) \mid \mid p(z \mid x)) =\] \[\mathbf{E}_q[\log q_\lambda(z \mid x)]- \mathbf{E}_q[\log p(x, z)] + \log p(x)\]

Our goal is to find the variational parameters $\lambda$ that minimize this divergence. The optimal approximate posterior is thus

\[q_\lambda^* (z \mid x) = {\arg\min}_\lambda \mathbb{KL}(q_\lambda(z \mid x) \mid \mid p(z \mid x)).\]

Why is this impossible to compute directly? The pesky evidence $p(x)$ appears in the divergence. This is intractable as discussed above. We need one more ingredient for tractable variational inference. Consider the following function:

\[ELBO(\lambda) = \mathbf{E}_q[\log p(x, z)] - \mathbf{E}_q[\log q_\lambda(z \mid x)].\]

Notice that we can combine this with the Kullback-Leibler divergence and rewrite the evidence as

\[\log p(x) = ELBO(\lambda) + \mathbb{KL}(q_\lambda(z \mid x) \mid \mid p(z \mid x))\]

By Jensen’s inequality, the Kullback-Leibler divergence is always greater than or equal to zero. This means that minimizing the Kullback-Leibler divergence is equivalent to maximizing the ELBO. The abbreviation is revealed: the Evidence Lower BOund allows us to do approximate posterior inference. We are saved from having to compute and minimize the Kullback-Leibler divergence between the approximate and exact posteriors. Instead, we can maximize the ELBO which is equivalent (but computationally tractable).

In the variational autoencoder model, there are only local latent variables (no datapoint shares its latent $z$ with the latent variable of another datapoint). So we can decompose the ELBO into a sum where each term depends on a single datapoint. This allows us to use stochastic gradient descent with respect to the parameters $\lambda$ (important: the variational parameters are shared across datapoints - more on this here). The ELBO for a single datapoint in the variational autoencoder is:

\[ELBO_i(\lambda) = \mathbb{E}{q_\lambda(z\mid x_i)}[\log p(x_i\mid z)] - \mathbb{\mathbb{KL}}(q_\lambda(z\mid x_i) \mid\mid p(z)).\]

To see that this is equivalent to our previous definition of the ELBO, expand the log joint into the prior and likelihood terms and use the product rule for the logarithm.

Let’s make the connection to neural net language. The final step is to parametrize the approximate posterior $q_\theta (z \mid x, \lambda)$ with an inference network (or encoder) that takes as input data $x$ and outputs parameters $\lambda$. We parametrize the likelihood $p(x \mid z)$ with a generative network (or decoder) that takes latent variables and outputs parameters to the data distribution $p_\phi(x \mid z)$. The inference and generative networks have parameters $\theta$ and $\phi$ respectively. The parameters are typically the weights and biases of the neural nets. We optimize these to maximize the ELBO using stochastic gradient descent (there are no global latent variables, so it is kosher to minibatch our data). We can write the ELBO and include the inference and generative network parameters as:

\[ELBO_i(\theta, \phi) = \mathbb{E}{q_\theta(z\mid x_i)}[\log p_\phi(x_i\mid z)] - \mathbb{KL}(q_\theta(z\mid x_i) \mid\mid p(z)).\]

This evidence lower bound is the negative of the loss function for variational autoencoders we discussed from the neural net perspective; $ELBO_i(\theta, \phi) = -l_i(\theta, \phi)$. However, we arrived at it from principled reasoning about probability models and approximate posterior inference. We can still interpret the Kullback-Leibler divergence term as a regularizer, and the expected likelihood term as a reconstruction ‘loss’. But the probability model approach makes clear why these terms exist: to minimize the Kullback-Leibler divergence between the approximate posterior $q_\lambda(z \mid x)$ and model posterior $p(z \mid x)$.

What about the model parameters? We glossed over this, but it is an important point. The term ‘variational inference’ usually refers to maximizing the ELBO with respect to the variational parameters $\lambda$. We can also maximize the ELBO with respect to the model parameters $\phi$ (e.g. the weights and biases of the generative neural network parameterizing the likelihood). This technique is called variational EM (expectation maximization), because we are maximizing the expected log-likelihood of the data with respect to the model parameters.

That’s it! We have followed the recipe for variational inference. We’ve defined:

a probability model $p$ of latent variables and data
a variational family $q$ for the latent variables to approximate our posterior

Then we used the variational inference algorithm to learn the variational parameters (gradient ascent on the ELBO to learn $\lambda$). We used variational EM for the model parameters (gradient ascent on the ELBO to learn $\phi$).

Experiments

Now we are ready to look at samples from the model. We have two choices to measure progress: sampling from the prior or the posterior. To give us a better idea of how to interpret the learned latent space, we can visualize what the posterior distribution of the latent variables $q_\lambda(z \mid x)$ looks like.

Computationally, this means feeding an input image $x$ through the inference network to get the parameters of the Normal distribution, then taking a sample of the latent variable $z$. We can plot this during training to see how the inference network learns to better approximate the posterior distribution, and place the latent variables for the different classes of digits in different parts of the latent space. Note that at the start of training, the distribution of latent variables is close to the prior (a round blob around $0$).

Visualizing the learned approximate posterior during training. As training progresses the digit classes become differentiated in the two-dimensional latent space.

We can also visualize the prior predictive distribution. We fix the values of the latent variables to be equally spaced between $-3$ and $3$. Then we can take samples from the likelihood parametrized by the generative network. These ‘hallucinated’ images show us what the model associates with each part of the latent space.

Visualizing the prior predictive distribution by looking at samples of the likelihood. The x and y-axes represent equally spaced latent variable values between -3 and 3 (in two dimensions).

Glossary

We need to decide on the language used for discussing variational autoencoders in a clear and concise way. Here is a glossary of terms I’ve found confusing:

Variational Autoencoder (VAE): in neural net language, a VAE consists of an encoder, a decoder, and a loss function. In probability model terms, the variational autoencoder refers to approximate inference in a latent Gaussian model where the approximate posterior and model likelihood are parametrized by neural nets (the inference and generative networks).
Loss function: in neural net language, we think of loss functions. Training means minimizing these loss functions. But in variational inference, we maximize the ELBO (which is not a loss function). This leads to awkwardness like calling optimizer.minimize(-elbo) as optimizers in neural net frameworks only support minimization.
Encoder: in the neural net world, the encoder is a neural network that outputs a representation $z$ of data $x$. In probability model terms, the inference network parametrizes the approximate posterior of the latent variables $z$. The inference network outputs parameters to the distribution $q(z \mid x)$.
Decoder: in deep learning, the decoder is a neural net that learns to reconstruct the data $x$ given a representation $z$. In terms of probability models, the likelihood of the data $x$ given latent variables $z$ is parametrized by a generative network. The generative network outputs parameters to the likelihood distribution $p(x \mid z)$.
Local latent variables: these are the $z_i$ for each datapoint $x_i$. There are no global latent variables. Because there are only local latent variables, we can easily decompose the ELBO into terms $\mathcal{L}_i$ that depend only on a single datapoint $x_i$. This enables stochastic gradient descent.
Inference: in neural nets, inference usually means prediction of latent representations given new, never-before-seen datapoints. In probability models, inference refers to inferring the values of latent variables given observed data.

One jargon-laden concept deserves its own subsection:

Mean-field versus amortized inference

This issue was very confusing for me, and I can see how it might be even more confusing for someone coming from a deep learning background. In deep learning, we think of inputs and outputs, encoders and decoders, and loss functions. This can lead to fuzzy, imprecise concepts when learning about probabilistic modeling.

Let’s discuss how mean-field inference differs from amortized inference. This is a choice we face when doing approximate inference to estimate a posterior distribution of latent variables. We might have various constraints: do we have lots of data? Do we have big computers or GPUs? Do we have local, per-datapoint latent variables, or global latent variables shared across all datapoints?

Mean-field variational inference refers to a choice of a variational distribution that factorizes across the $N$ data points, with no shared parameters:

\[q(z) = \prod_i^{N} q(z_i; \lambda_i)\]

This means there are free parameters for each datapoint $\lambda_i$ (e.g. $\lambda_i = (\mu_i, \sigma_i)$ for Gaussian latent variables). How do we do ‘learning’ for a new, unseen datapoint? We need to maximize the ELBO for each new datapoint, with respect to its mean-field parameter(s) $\lambda_i$.

Amortized inference refers to ‘amortizing’ the cost of inference across datapoints. One way to do this is by sharing (amortizing) the variational parameters $\lambda$ across datapoints. For example, in the variational autoencoder, the parameters $\theta$ of the inference network. These global parameters are shared across all datapoints. If we see a new datapoint and want to see what its approximate posterior $q(z_i)$ looks like, we can run variational inference again (maximizing the ELBO until convergence), or trust that the shared parameters are ‘good-enough’. This can be an advantage over mean-field.

Which one is more flexible? Mean-field inference is strictly more expressive, because it has no shared parameters. The per-data parameters $\lambda_i$ can ensure our approximate posterior is most faithful to the data. Another way to think of this is that we are limiting the capacity or representational power of our variational family by tying parameters across datapoints (e.g. with a neural network that shares weights and biases across data).

Sample PyTorch/TensorFlow implementation

Here is the implementation that was used to generate the figures in this post: Github link

Footnote: the reparametrization trick

The final thing we need to implement the variational autoencoder is how to take derivatives with respect to the parameters of a stochastic variable. If we are given $z$ that is drawn from a distribution $q_\theta (z \mid x)$, and we want to take derivatives of a function of $z$ with respect to $\theta$, how do we do that? The $z$ sample is fixed, but intuitively its derivative should be nonzero.

For some distributions, it is possible to reparametrize samples in a clever way, such that the stochasticity is independent of the parameters. We want our samples to deterministically depend on the parameters of the distribution. For example, in a normally-distributed variable with mean $\mu$ and standard devation $\sigma$, we can sample from it like this:

\[z = \mu + \sigma \odot \epsilon,\]

where $\epsilon \sim Normal(0, 1)$. Going from $\sim$ denoting a draw from the distribution to the equals sign $=$ is the crucial step. We have defined a function that depends on the parameters deterministically. We can thus take derivatives of functions involving $z$, $f(z)$ with respect to the parameters of its distribution $\mu$ and $\sigma$.

The reparametrization trick allows us to push the randomness of a normally-distributed random variable z into epsilon, which is sampled from a standard normal. Diamonds indicate deterministic dependencies, circles indicate random variables.

In the variational autoencoder, the mean and variance are output by an inference network with parameters $\theta$ that we optimize. The reparametrization trick lets us backpropagate (take derivatives using the chain rule) with respect to $\theta$ through the objective (the ELBO) which is a function of samples of the latent variables $z$.