GeistHaus
log in · sign up

Jaan Lı

Part of jaan.io

Machine learning for health & science at One Fact & Tartu.

stories primary
One Fact Foundation
Show full content

One fact is all it takes to create change. I started this nonprofit foundation and assembled a team of co-founders with superpowers to scale our AI for health and education globally. I gave a keynote at an NIH conference about our first product.

One Fact Foundation was originally published by Jaan Lı at Jaan Lı on February 19, 2023.

https://jaan.io/one-fact-foundation
Virtual Thesis Defense - Giving and Recording a Stressful Zoom Presentation
Show full content

I defended my PhD on Zoom (YouTube). It was stressful, but I hope these notes save you time for presenting and enable you to record higher-quality academic presentations. Please let me know if you have any suggestions I missed!

General Presentation

Many tips from meatspace apply. I recommend Matt Might’s tri-tips:

For example, using Keynote and a bluetooth clicker makes a difference. In addition, I suggest standing to give the presentation and wearing shoes, and anything else that would help make it feel more like a standard academic presentation instead of mid-pandemic grilling hour.

The ad-hoc setup I used for my thesis defense.
Camera, Lighting, Audio, Makeup

I am not an expert and guessed at all of this the day of the defense. It seems like an easier solution is to use a smartphone for video with the Filmic Pro app, as suggested in Conan O’Brien’s filming-from-home setup.

Camera: I borrowed a friend’s Fujifilm X-T1. On my Macbook, it would only be usable if the Camlink was plugged into the right-hand-side USB 3.0 port.

Camera placement: I put the camera just above the external display the Keynote slide show was playing on. In hindsight, a laptop would have been sufficient, along with an external display to force Keynote to put the presenter view on the laptop. In the video, my eyes visibly look to the left or right on the large external display, and a laptop would have acted more like a teleprompter to keep my eyes close to the camera.

Where to stand: by looking at the video feed, I marked the floor with tape at the line where I would go out of the frame if I got too close to the laptop.

Makeup: I am not an expert, but I was lucky to be powdered to hide some of the stress-induced hyperhidrosis in my oily T-zone.

Lighting: I used a seasonal affective disorder light therapy lamp (ironically called the Happy Light) as a makeshift high-CRI ring light.

Audio: I really wanted to wrangle a microphone cable on video, and thought it would be funny to use a hand-held microphone during the presentation (not to mention higher-quality audio). This personal just-for-me Easter egg was just silly enough to make my day. I used a Shure SM58 plugged into an external audio device, and recorded it with Quicktime.

Recording and Editing

Setup for recording: In Zoom, Conan O’Brien’s setup covers this. Record to separate audio channels and record in high definition. As backup audio recordings I used my laptop microphone and smartphone. I used Elgato Game Capture to record the video, but in hindsight would have used Adobe Premiere to make editing easier. Keynote has a record feature for recording the slide show for higher resolution.

Recording: The day-of recording checklist was specific and non-deterministic: turn on smartphone audio recording, turn on Quicktime audio recording, turn on Game Capture HD recording for the video feed, then turn on Zoom recording, then screen share on Zoom (screen sharing to an external display renders the menu bar becomes inaccessible), and only then turn on Keynote recording. If these steps were done out of order, something—typically Zoom—would break. After making this checklist, I had 5 minutes left before my defense with which to rehearse. Please don’t put yourself in this position like I did.

Editing: it takes a while to edit both the audio and video. I used Ableton to de-ess, compress, and add reverbe to the audio, and stitch together the Zoom recording with the recording from my microphone. In Adobe Premiere, I synchronized the audio to the video, applied minor color correction, and exported it in high definition. Tools such as ffmpeg, homebrew, and imagemagick were very useful in converting between audio and video formats, or fixing timing issues, or making cute GIFs. In hindsight, it is worth paying someone to do this as it is a lot of work, and YouTube is rife with ‘I paid someone $5 to edit my video!!!!’ reviews of Fiverr as an economical option.

Slides

Most likely, people will use their personal laptops to view your defense, so the thumbnail of the main speaker will be tiny and the slide show itself will be huge. Small details on slides will be noticeable. Use Keynote to make the presentation (see mine here). Use Illustrator or Inkscape to make graphics and icons—I find free icons from a search engine, then edit the vectorized graphics and export to PDF; my Illustrator file is here. For math in Keynote, I recommend LaTeXiT that comes with MacTex/MikTeX, and putting white squares over new lines of math or terms in equations with the ‘dissolve in’ animation. Adding dissolving rectangles helped me go slower and more pedagogically with already-familiar material.

Practice

I recommend giving the talk several times beforehand, to a live audience (physically-distanced if at all possible). I feel very lucky that I was able to give parts of my presentation in stressful high-stakes environments ahead of time, as preparation. As an example of how jarring it is to give a virtual presentation: on Zoom, it was impossible to know who was laughing because everyone was muted and had their video off. This meant when I got to a joke in my presentation, I felt extremely uncomfortable because I could not see reactions, so did my best to deliver it deadpan. However, because I had given parts of the presentation in front of a live audience, I was more confident that some parts might still be funny, and this helped me keep going rather than freeze up.

If you are practicing on Zoom, you might ask a few audience members to leave their video feed on during your presentation. I did not think of this ahead of time, and had to consciously remind myself that I was talking to humans rather than a camera. During the question and answer period, it felt a lot nicer to be talking to human video feeds than to a Fujifilm X-T1 staring me down.

After giving the talk several times, a full run-through entails going from dressing up, getting the lighting correct, all the recordings finagled, answering questions from the audience after a full presentation, and practicing editing in Premiere or via Fiverr. This seems like a lot of prep, but the cognitive load it entails is massive on top of a stressful defense. The run-through is worth it, to learn to trust or mistrust failure modes appropriately and figure out what to do in worst-case scenarios.

Day of the Defense

Assume Murphy’s Law: everything that can go wrong, will go wrong. This held true in my case. Keep several checklists on hand with what you need to do, keyboard shortcuts you might need (such as muting yourself or the audience), water and snacks, etc.

For example, Zoom became non-deterministic: one example of a bug I discovered 30 minutes before the presentation was that if screen sharing was on, the Mac menu bar became inaccessible, and I had to learn and write down all the keyboard shortcuts I needed for all the apps during the presentation. (One would need to mouse to the top of the primary display, only to have the menu bar appear in the second display.)

Another Zoom bug: broken screen-sharing after making Keynote full-screen, namely if an external display is plugged in and is being used for Keynote presenter mode, Zoom will stream the presenter mode instead of the actual presentation.

Silver Linings

I recommend bcc’ing many people to invite them to your defense, because this is now possible and low-stakes. I was scared to do this, but went through my short message service history, WhatsApp, Signal, Facebook, email, and (gratitude) journals to remember who to invite, and was surprised at the response.

Even people I had not interacted with in years were supportive, happy to have some good news during quarantine, and ready to join a surreal Zoom call on short notice. It made me feel more supported during an otherwise stressful time.

Another positive reappraisal I found helpful was that all this work and nitpicking about camera, audio, presentation quality was less for me, more for all the people who got me to this point (for example, my grandmother in Estonia who would not able to see the presentation live, but I knew she would love to see the video).

Similarly, writing these notes helps being done with grad school feel more concrete psychologically. After all, I was in an empty room, talking to a camera, and seemingly got my PhD—maybe the simulation has improved after all.

Please send me a link to your talk or defense! I would love to see it. In addition, feel free to email me with any tips or corrections to these notes.

Thanks to Will Whitney for camera-lending and everyone else in my thesis acknowledgments and otherwise who got me to this point.

Virtual Thesis Defense - Giving and Recording a Stressful Zoom Presentation was originally published by Jaan Lı at Jaan Lı on May 21, 2020.

https://jaan.io/virtual-thesis-defense-recording-zoom-presentation
How does physics connect to machine learning?
Show full content

Mandarin translation available - 用普通话阅读这篇文章: WeChat

I struggled to learn machine learning. I was used to variational tricks, MCMC samplers, and discreet Taylor expansions from years of physics training. Now the concepts were mixed up. The intuitive models of physical systems were replaced by abstract models of ‘data’ and amechanical patterns of cause and effect.

I had to fit these fields together. Physics and machine learning are intricately connected, but it is taking me years to make the overlaps precise. This process requires representing the new with the familiar, mapping jargon from one field to another.

A simple model of magnets—the Ising model—will help illustrate the rich connection between these fields. We first analyze this model with physics intuition. Then we derive the variational principle in physics and show that it recovers the same solution.

We then discover how that very same variational principle in physics opens a window into machine learning. We identify Boltzmann distributions as exponential families to make the mapping transparent, and show how approximate posterior inference is scaled to massive data thanks to the variational principle.

If you have a physics background, I hope you will have a better sense of machine learning and be able to read papers in the field. If you are a machine learner, I hope you will have the context to read a statistical physics paper about mean-field theory and the Ising model.

If this article is confusing, falls short of these goals, or could be improved in any way please email me, @ me, or submit a pull request.

The Ising model, a physics perspective

Consider a lattice of spins that point up or down:

image/svg+xml

What features might make this a convincing model of magnetism?

Think about playing with magnets—if you put them close together, they pull each other closer. They repulse each other when their poles oppose, and if they’re far apart, they don’t attract.

This means neighboring spins should affect each other in our model: if the spins around \(s_i\) point upward, it should also want to point upward.

Let’s refer to the spin at location \(i\) as \(s_i\). A spin can be in one of two states: a spin can point up (\(s_i=+1\)) or down (\(s_i=-1\)).

We can capture our intuition about spins being attracted to each other (they want to point in the same direction) or repulsed (they want to point in opposite directions) by introducing a parameter \(J\). This interaction parameter captures the interaction strength between spin \(i\) and spin \(j\).

If two neighboring spins point in the same direction, we’ll have them contribute a term \(-J\) to the total energy; if they point in opposing directions, they will contribute \(J\).

This lets us write the energy function, or Hamiltonian, of the system:

\[E(s_1, s_2,...,s_N) = -\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^NJ_{ij} s_i s_j\]

Here \(J_{ij} = J\) if spins \(i\) and \(j\) are neighbors, and \(J_{ij} = 0\) otherwise. The factor of \(\frac{1}{2}\) in front is to account for double counting from the sums over both \(i\) and \(j\). Note that the system has finitely many spins (\(N\) spins).

A spin configuration or state of the system is a specific setting of values for all spins. The set \(\{s_1=+1, s_2=+1, s_3=-1, ..., s_N=+1\}\) is an example of a configuration.

The second law of thermodynamics says that at a fixed temperature and entropy, a system will seek configurations that minimize its energy. This lets us reason about interactions.

If the interaction strength \(J\) is zero, the spins do not interact and the system has the same energy, zero, for all configurations (i.e. the energy is trivially minimized). But if the interaction strength \(J\) is positive, the spins will tend to align to minimize the energy of the system \(E(s_1, s_2,...,s_N)\). This corresponds to minimization because of the minus sign convention in front of the sum in the energy function.

Now let’s introduce a magnetic field \(H\). Imagine the lattice of spins immersed in a magnetic field, perhaps the ambient field from the earth’s crust. The magnetic field affects every spin independently, and each spin will try to align with the field. We can include the magnetic field in the energy of the system by summing the independent contributions for each spin:

\[E(s_1, s_2,...,s_N) = -\frac{1}{2}\sum_{i, j} J_{ij}s_i s_j - H\sum_i s_i\]

We can reason about the magnetic field strength \(H\) by imagining what happens if it is large or small (strong or weak). If \(H\) is large and the interactions between spins are weak, the magnetic field term will dominate and the spins will align with the magnetic field to minimize energy. But if the magnetic field is small, it is more difficult to reason about.

Now that we have defined the Ising model and its characteristics, let’s think about our goals. What questions can we answer about this Ising model? For example, if we observe the system, what state will it be in—what are the most likely spin configurations? What is the average magnetization?

The Boltzmann distribution

Can we make our goals more precise and make math from words? To do this, we need to define a distribution over spin configurations. It is straightforward to derive the probability of finding the system in an equilibrium state 1:

\[p(s_1, s_2,...,s_N) = \frac{e^{-\beta E(s_1, s_2,...,s_N)}}{Z}\]

This is the Boltzmann distribution. The numerator is called the Boltzmann factor for a particular configuration. This factor gives high or low weight to a specific state of the system according to the energy for that state.

We query the Boltzmann distribution at a specific configuration of spins to get the probability of finding the system in this state.

For example, say the first spins in our configuration happen to be up, up, down, etc. We plug this in and get \(p(s_1=+1, s_2=+1, s_3=-1,...,s_N=+1)=0.7321\). This means this state was pretty likely.

This distribution behaves intuitively: low energy states are more probable than configurations with high energy. For example, if \(J=+1\), the spins will align, and the state where all spins point in the same direction is most probable. Why? Because it leads to the most negative energy function, which corresponds to the Boltzmann factor with the largest weight.

The parameter \(\beta\) is proportional to the inverse temperature, \(\beta = \frac{1}{k_BT}\) and is used for notational convenience. (Specifically, it includes the constant \(k_B\) to make the probability density dimensionless.) Temperature affects the model by controlling how important the interactions are. If \(T\rightarrow \infty\) we are at a high temperature, and the inverse temperature is small with \(\beta \ll 1\), so the interaction strength \(J\) is not important and has little effect. But at low temperatures, the inverse temperature is large, so interactions have a large effect on the system’s behavior.

The partition function

The denominator \(Z\) is of utmost importance. It ensures that the distribution integrates to \(1\) and is thus a valid probability distribution. We need this normalization to calculate properties of the system. Calculating mean values and other moments can only be done with a probability mass function. The name of \(Z\) is “partition function” or “normalizing constant”. It is the sum of each state’s Boltzmann factor:

\[Z = \sum_{s_1=\pm1}\sum_{s_2=\pm1}...\sum_{s_N=\pm1}e^{-\beta E(s_1, s_2, ..., s_N)}\]

I explicitly wrote out the sum to illustrate why we can’t evaluate this distribution: we need to sum over all possible configurations. Each spin has two states, and there are \(N\) spins. This leads to \(2^N\) terms in the sum. For a small system with a hundred spins, this is already greater than the number of atoms in the universe so we can never hope to calculate it 2.

Using the Boltzmann distribution to calculate properties of the system

We arrived at a probability distribution describing which states of the system are likely, however we were stumped by the intractable partition function. Let’s temporarily assume we have infinite computation and can calculate the Boltzmann distribution’s partition function. What are some interesting things we can learn about the system from it’s Boltzmann distribution?

This distribution lets us to calculate properties of the system as a whole by taking expectations (i.e. calculating observable quantities). For example, the magnetization \(m\) is the average magnetization over all spins:

\[m = \frac{1}{N} \langle s_1 + s_1 + ... + s_N \rangle = \langle s_i \rangle\]

Why should we care about this magnetization? It tells us about the system as a whole, the macrostate, rather than a specific microstate. We lose specificity because we can’t say anything about the first spin \(s_1\), but we learn about how it behaves across all possible states of the rest of the spins.

If the spins are aligned, the system is in an ordered state and the magnetization has a positive or negative sign. If the spins are anti-aligned, the system is disordered and the average magnetization is zero.

These are global phases of the system, and they depend on temperature. If the temperature \(T\) goes to infinity, the inverse temperature \(\beta\) goes to zero, and all states of the system are equally likely, as described by the Boltzmann distribution. But if the temperature is finite, then some states are more likely than others, and the system can transition between ordered and disordered phases. Such phase transitions and how they depend on the temperature are important for comparing how well this Ising model matches real-world materials 3.

Let’s remember that we can’t evaluate the partition function \(Z\). This situation seems hopeless for answering interesting questions like calculating the magnetization. But thankfully, we may be able to simplify the problem by considering each spin independently and figuring out an approximation…

Mean-field theory in physics

Because we cannot evaluate the intractable sum required to calculate the partition function, we turn to mean-field theory.

This is an approximation technique that can still let us answer questions about the system such as the average magnetization. We will study the dependence of the magnetization \(m\) on temperature.

To demonstrate the technique, it is easiest to focus on a single spin:

image/svg+xmls 1 s 3 s 2 s 4 s 5 H
The first spin of the Ising model in a magnetic field H. The magnetic field is shown with dashed lines. Its nearest neighbors provide an effective field through the interactions, denoted by lines connecting the spins.

The contribution of this single spin to the total energy of the system is simply the corresponding term in the energy:

\[E_{s_1} = -s_1\left(J\sum_{j=2}^{z+1} s_j + H\right)\]

The sum is over the \(z\) nearest neighbors. For the two-dimensional lattice we are considering, \(z = 4\). We can rewrite this energy for a single spin in terms of the fluctuations of a spin \(s_j\) around its mean value \(m = \langle s_j \rangle\). Replacing \(s_j = m + (s_j - m)\) gives

\[E_{s_1}= -s_1(zJm + H) -J s_1 \sum_{j=2}^z+1 (s_j - m)\]

The next step is crucial: we will ignore the fluctuations of neighboring spins around their mean value. In other words, we assume that the term \((s_j - m) \rightarrow 0\), so that each of the neighbors of \(s_1\) is simply equal to its mean value, \(s_j = m\).

When is this true?

When the fluctuations around the mean value are small, such as at low temperature ‘ordered’ phases. This assumption greatly simplifies the Hamiltonian for the spin:

\[E_{s_1}^{MF} = -s_1 (zJm + H).\]

This is the mean-field energy function for a single spin. It is equivalent to a non-interacting spin in an effective magnetic field, \(H^{eff}=zJm + H\).

Why do we say this spin is non-interacting? The energy function for the spin only depends on its state, \(s_1\), and does not depend on the state of any other spins. We have approximated the interaction effects by the average magnetic field induced by the neighboring spins; this is the mean field.

In this mean-field model, each spin feels the effects of the magnetic field applied to the entire system, \(H\), as well as the ‘effective’ mean field from its neighboring spins \(zJm\).

We clarify this interpretation by writing

\[H\leftarrow H + \Delta H\]

where \(\Delta H = zJm\) is the the average magnetic field (‘mean field’) of the neighbors of each spin.

By ignoring the fluctuations of each spin, we have reduced the complexity of the problem. Instead of \(N\) interacting spins, we have \(N\) independent spins in a uniform magnetic field \(H\) with a small correction \(\Delta H\) to account for the effects of interactions.

We write the energy function for the mean-field model as

\[E_{MF}(s_1, s_2,...,s_{N}) = -(H+\Delta H)\sum_{i=1}^Ns_i.\]

This shows that there are no interaction terms anymore (the term \(s_is_j\) doesn’t occur in the energy).

In other words, we can treat each spin independently, then combine the results appropriately to model the entire system!

We have radically changed the nature of the problem.

Instead of computing the partition function \(Z\) for the whole system, we can now compute it for a single spin.

This is straightforward and has an analytic solution 4:

\[Z_{s_1} = \sum_{s_i = \pm 1} e^{-\beta s_i(H+\Delta H)}\] \[\Rightarrow Z_{s_1} = 2\cosh{[\beta(H+\Delta H)]}\]

The partition function for the entire mean-field model with \(N\) spins is then

\[Z_{MF} = (2\cosh{[\beta(H+\Delta H)]} )^N.\]

With the partition function in hand, we can get the Boltzmann distribution and answer questions about the system such as magnetization.

We get the magnetization by taking an expectation over the distribution for the spin. The last step is to require that for any spin \(i\), its average magnetization should equal the magnetization of the system as a whole:

\[m = \sum_{s_i=\pm 1} p(s_i) s_i\] \[\Rightarrow m =\tanh{[\beta({H + \Delta H})]}\]

This gives us a clean equation for the magnetization,

\[m = \tanh{[\beta(H + zJm)]},\]

where we used that the mean-field parameter is \(\Delta H = zJm\).

This is a formula for the magnetization \(m\) as a function of temperature. It has no closed form solution, but we can plot both sides of the equation and see where they intersect to find the implicit solutions (drag the slider to re-plot at a new temperature):

First let’s think about the case when there is no external field, \(H = 0\).

For high temperatures, the equation only has one solution: \(m = 0\). This aligns with our intuition—if we look at the energy of the system, the inverse temperature \(\beta\) goes to zero and all states of the spins are equally likely. They average out to zero.

For low temperatures, we see three solutions: \(m=0\) and \(m= \pm \lvert m\lvert\). The additional \(\pm\) solutions appear when the slope of the \(\tanh\) function at the origin is greater than one:

\[\frac{d}{dm} \tanh{[\beta zJm]} |_{m=0} > 1\] \[\Rightarrow \beta zJ > 1\]

The “critical temperature” at which the phase transition occurs is when \(\beta zJ = \frac{1}{k_B T} zJ = 1\), or when \(k_B T_c = zJ\).

This gives us a testable prediction: we can take a magnetic material, and measure what temperature its phase transition occurs at.

Have we accomplished our goal?

We set out to understand the behavior of this model at various temperatures, in terms of global properties like the magnetization.

By considering a single spin and approximating the effects of other spins as an effective magnetic field, we were able to reduce the complexity of the problem. This allowed us to study phase transitions. However, our exposition felt a little hand-wavy, so let’s dive into a rigorous foundation to justify our intuitions.

Deriving the variational free energy principle: the Gibbs-Bogoliubov-Feynman inequality

Can we learn what tradeoffs we make when we make the assumption of ‘ignoring fluctuations’ of spins around their mean values? Specifically, how can we gauge the quality of results derived from our mean-field theory?

We can rederive the mean-field results in the previous section by directly attacking the problem of the intractable partition function. We can try to approximate this partition function with a simpler one.

Recall that the partition function \(Z\) for the system is

\[Z = \sum_{s_1, s_2, ..., s_N}e^{-\beta E(s_1, s_2,...,s_N)}\]

where as before, the energy for the system is

\[E(s_1, s_2,...,s_N) = -\frac{1}{2}\sum_{i, j} J_{ij}s_i s_j - H\sum_i s_i.\]

The complexity of computing the partition function comes from the interaction term with \(s_is_j\). We saw that without this term, we were able to reduce the problem to dealing with a system of independent spins.

To derive the variational principle, we will therefore assume an energy function of the form

\[E_{MF}(s_1, s_2,...,s_N) = -(H + \Delta H) \sum_{i=1}^N s_i\]

Previously we saw that the mean-field parameter is \(\Delta H = zJm\) which we derived using our physics intuition.

Now we ask the question: is this the optimal effective magnetic field? We can think of \(\Delta H\) as a parameter of the mean-field model that we can tune to get the best answers for the original system.

This is known as perturbation theory: we are perturbing the magnetic field of the system and trying to find the optimal perturbation that yields a good approximation to the original system.

What does a ‘good approximation’ entail? Our difficulties were in computing the partition function. We therefore want to approximate the partition function of the original system \(Z\) with the partition function of our mean-field system \(Z_{MF}\). Let’s hope that \(Z_{MF}\) is easy to calculate and does not require a sum on the order of the number of atoms in the universe.

First let’s see if we can express the partition function of the original system \(Z\) in terms of our approximation. We can measure how the energy of the mean-field system deviates from the reference system by computing the fluctuations in energy:

\[\Delta E(s_1, s_2,...,s_N) = E(s_1, s_2,...,s_N) - E_{MF}(s_1, s_2,...,s_N)\] \[\Rightarrow E = E_{MF} + \Delta E\]

This lets us reëxpress the original partition function as:

\[Z = \sum_{s_1, s_2,...,s_N} \exp{[-\beta(E_{MF} + \Delta E)]}\] \[\Rightarrow Z =Z_{MF} \sum_{s_1, s_2,...,s_N}\frac{\exp{(-\beta E_{MF})} \exp{(-\beta\Delta E)}}{Z_{MF}}\]

For the next step, we need the definition of an expectation of a function \(A\) with respect to the mean-field Boltzmann distribution:

\[\langle A \rangle_{MF} =\sum_{s_1, s_2,...,s_N} \frac{A e^{-\beta E_{MF}}}{Z_{MF}}\]

This means we can write the partition function of the system in terms of the mean-field partition function as:

\[Z=Z_{MF}\langle \exp{(-\beta \Delta E)}\rangle_{MF}\]

This is an exact factorization of the partition function of the original system. It is the mean-field partition function weighted by the expected Boltzmann factor for energy fluctuations away from the reference system.

However, integrating this complicated exponential function is difficult, even with respect to the mean-field system. We’ll simplify it with a classic physics trick—by pulling a Taylor expansion.

Let’s assume that the fluctuations of the energy are small; \(\Delta E \ll 1\). Then we can Taylor expand the exponent:

\[\langle \exp{(-\beta \Delta E)}\rangle_{MF}~\approx~\langle 1 - \beta \Delta E + ... \rangle_{MF}\] \[=~1 - \beta \langle \Delta E\rangle_{MF}+...\] \[=\exp{(-\beta \langle \Delta E\rangle_{MF})} + ...\]

We have neglected terms of second order in the fluctuations \(\Delta E\). This gives us our first-order perturbation theory result for the partition function of the original system:

\[Z \approx Z_{MF}\exp{(-\beta \langle \Delta E\rangle_{MF})}\] \[\Rightarrow Z \approx Z_{MF}\exp{(-\beta \langle E - E_{MF}\rangle_{MF})}\]

How good is the approximation? We need a simple identity 5: \(e^x \geq x + 1\).

Let’s apply this to the expectation in the exact factorization of the partition function, taking \(f = -\beta \Delta E\):

\[\langle e^f\rangle = e^{\langle f\rangle} \langle e^{(f - \langle f \rangle)} \rangle\] \[\geq e^{ \langle f \rangle} \langle 1 + f - \langle f \rangle\rangle = e^{\langle f \rangle}\]

Now we can get a lower bound on the partition function:

\[Z = Z_{MF}\langle \exp{(-\beta \Delta E)}\rangle_{MF}\] \[\Rightarrow Z \geq Z_{MF} \exp{[-\beta \langle E - E_{MF}\rangle_{MF}]}\]

This inequality is the Gibbs-Bogoliubov-Feynman inequality. It tells us that with our mean-field approximation, we get a lower bound on the original partition function.

Variational treatment of the Ising model using the Gibbs-Bogoliubov-Feynman inequality

Let’s apply this theory: do we recover the same results for magnetization in the Ising model?

In the mean-field Ising model, we treat each spin independently, so the energy function of the system decomposes into independent parts:

\[E_{MF}(s_1, s_2,...,s_N) = -(H + \Delta H) \sum_{i=1}^N s_i\]

Here \(\Delta H\) is the effective magnetic field strength. It is a parameter we can tune to maximize the lower bound on the partition function.

Let’s plug this into the lower bound on the partition function from the Gibbs-Bogoliubov-Feynman inequality. Then we take the derivative to maximize the lower bound:

\[0 = \frac{\partial}{\partial{\Delta H}} Z_{MF} \exp{[-\beta \langle E - E_{MF}\rangle_{MF}]}\]

First, we need to evaluate the expectation:

\[\langle E - E_{MF}\rangle_{MF} = -N(\frac{1}{2} Jz\langle s_1\rangle^2_{MF} - \Delta H \langle s_1 \rangle_{MF}),\]

where we used the mean-field assumption that the spins are independent, hence \(\langle s_i s_j\rangle_{MF} = \langle s_i\rangle_{MF} \langle s_j\rangle_{MF}\).

We also assumed that for a large enough system, spins at the edges of the model (boundary conditions) can be ignored, so all spins have the same average magnetization: \(\langle s_i\rangle_{MF}\langle s_j\rangle_{MF} = \langle s_1\rangle_{MF}^2\).

Plugging this in to the lower bound on the partition function and differentiating gives

\[0 = \tanh[\beta(H + \Delta H)] - \langle s_1\rangle_{MF} - Jz\langle s_1\rangle_{MF} \frac{\partial}{\partial \Delta H} \langle s_1\rangle_{MF} + \Delta H \frac{\partial}{\partial \Delta H} \langle s_1\rangle_{MF}\] \[\Rightarrow \Delta H = Jz \langle s_1\rangle_{MF}.\]

We used that \(m = \langle s_1\rangle_{MF} = \tanh{[\beta({H + \Delta H})]}\) from before.

This confirms our earlier reasoning, that the optimal mean-field parameter is \(\Delta H = Jzm\). There were three steps to this process. We started by defining the model we cared about, we wrote down a mean-field approximation to it, and we maximized a lower bound on the partition function.

The machine learning perspective on the Ising model

Now let’s frame what we just did in the language of machine learning. More specifically, let’s think in terms of probabilistic modeling.

We need some definitions to see how the variational principle is equivalent to variational inference in machine learning.

The Ising model is an undirected graphical model or Markov random field. We can represent the conditional dependencies of the model using a graph; the nodes in the graph are random variables. These random variables are the spins of the Ising model, so two nodes are connected by an edge if they interact. This lets us encode the joint distribution of the random variables in the following diagram:

image/svg+xmls i
A representation of the Ising model as an undirected graphical model. The nodes are random variables (spins) and edges denote conditional dependencies between their distributions.

The Boltzmann distribution is a parameterization of the joint distribution of this graphical model. This figure looks very similar to the physics spin-based representation—the spins are random variables. We can also write the joint distribution of the nodes in exponential family form. Exponential family distributions let us reason about a broad class of models and deserve a header.

Exponential families

A way to parameterize probability distributions like the Ising model is with exponential families. These are families of distributions that support a representation in this specific, convenient mathematical form:

\[p(x ; \eta) = h(x)e^{\eta^\top t(x) - a(\eta)}\]

Here \(\eta\) is called the natural parameter, \(h(x)\) is the base measure, \(t(x)\) the sufficient statistic and \(a(\eta)\) is the log normalizer, or log partition function. I was confused about exponential families for a long time and found concrete derivation helpful.

For example, we are used to seeing the Bernoulli distribution in the following form 6:

\[p(x ; \pi) = \pi^x(1-\pi)^{(1-x)}\]

We can rewrite this in exponential family form:

\[p(x; \eta) = \exp{\{x\log \pi + (1-x)\log{(1-\pi)}\}}\] \[\Rightarrow p(x; \eta)=\exp{\{x\log \frac{\pi}{1-\pi} + \log{(1-\pi)}\}}\]

Comparing to the above formula for exponential families reveals the natural parameter, base measure, sufficient statistic, and log normalizer for the Bernoulli, given by \(\eta = \log{\frac{\pi}{1-\pi}}\), \(t(x) = x\), \(a(\eta) = -\log{(1-\pi)} = \log{(1+e^\eta)}\), and \(h(x) = 1\) respectively.

More connections to physics: the log normalizer is the log of the partition function. This is made clear in the exponential family form of the Bernoulli: \(\log Z = \log \sum_{x\in\{0,1\}} e^{\eta x} = \log{(1+e^\eta)}\). We can now identify the parameter \(\eta\) as a analogous to temperature, with \(x\) as a spin. We’ve identified the Ising model’s exponential family form!

The exponential family form of the Ising model

Let’s connect this to the energy function of the Ising model by writing its Boltzmann distribution in exponential family form:

\[p(s_1, s_2,...,s_N; \beta, J, H) = \frac{e^{-\beta E(s_1, ..., s_N)}}{Z}\] \[p(s_1, s_2,...,s_N; \beta, J, H) = \exp{\{-\sum_{(i, j)\in E}\beta Js_is_j + -\sum_{i \in V}\beta Hs_i - \log{Z}\}}\] \[p(s_1, s_2,...,s_N; \theta)=\exp{\{ -\sum_{(i, j)\in E} \theta_{ij}s_is_j -\sum_{i \in V} \theta_i s_i - a(\theta)\}}\]

We have introduced some new notation common to graphical models: we have specified a joint distribution over a collection of random variables \(\{s_1, ..., s_N\}\) that live on the graph over vertices \(V\), joined by edges in the set \(E\).

This is the exponential family form of the Ising model, a probability model with model parameters \(\theta\). To equate it to the form we saw earlier, set \(\theta_{ij} = \frac{1}{2}\beta J\) if \(i\) and \(j\) share an edge (i.e. they are neighbors), and set \(\theta_i = H\).

For the Ising model, we can see that there are two sets of model parameters. The spin-spin interaction parameter multiplied by the inverse temperature \(\beta J\) controls the effects of each edge in the graph. The inverse temperature multiplied by the magnetic field \(\beta H\) affects each spin independently. We can also say that the inverse temperature \(\beta\) is a global model parameter. For a fixed interaction and magnetic field, we can vary the temperature to index a specific model.

This is a subtle but important point. Our joint distribution over the set of random variables (the \(N\) spins) is indexed by the set of model parameters. By varying the inverse temperature parameter \(\beta\), we are actually selecting a specific model (the Ising model at that temperature). Ditto for a specific choice of the spin-spin interaction parameter \(J\).

What questions can we ask about the model?

Computing the magnetization \(m = \frac{1}{N}\langle s_1 + ... + s_N \rangle = \langle s_i \rangle\) means calculating the expectation \(\mathbb{E}_{p(s_i)}[s_i]\). In probability language, this means calculating the marginal expectation of a node \(i\).

But calculating the marginal distribution is intractable for reasons we already discussed: it requires marginalizing over all other nodes \(j \neq i\):

\[p(s_i) = \sum_{s_1=\pm 1} ... \sum_{s_{i-1}=\pm 1}\sum_{s_{i+1}=\pm 1}...\sum_{s_N=\pm1} p(s_1,...,s_{i-1}, s_i, s_{i+1}, ..., s_N)\]

The situation is hopeless: not only do we need to calculate the normalizing constant for the joint distribution of \(N\) nodes, which has \(2^N\) terms, but then we need to marginalize over \(N-1\) variables (another \(2^{N-1}\) terms).

This is identical to what we saw in the partition function, when thinking about this model from a physics perspective.

Can we still answer questions about the marginal distributions by resorting to a variational principle?

Variational inference in machine learning

If we could calculate the sum over all configurations of random variables, we could calculate the partition function. But we can’t, because the sum grows as \(2^N\).

With our physics hat on, our strategy was to approximate to the partition function.

From a machine learning perspective, this technique is known as variational inference. We vary something simple to infer something complicated.

Let’s look at how the variational free energy is derived in machine learning and used to approximate partition functions.

We have a probability model of random variables \(p_\theta(s_1, ..., s_N)\) and we seek to calculate its normalizing constant or partition function 7.

Let’s construct a simpler probability distribution \(q_\lambda(s_1, ..., s_N)\), parameterized by \(\lambda\), and use it to approximate our model.

How good is our approximation? One way of measuring how close our approximation is to our goal distribution is with the Kullback-Leibler divergence.

This divergence between \(q\) and \(p\), or relative entropy, measures the amount of information (in bits or nats) that is lost when using \(q\) to approximate \(p\).

This gives us a criteria with which to vary our approximation. We vary the \(\lambda\) parameter of our approximation until we minimize the approximation error, as measured by the Kullback-Leibler divergence.

The KL divergence is written with a double vertical bar as

\[\textrm{KL}(q(s) \mid\mid p(s)) = \int q(s) \log \frac{q(s)}{p(s)}ds\]

Let’s assume we are dealing with an exponential family distribution such as the Ising model. We let \(p\) be the Boltzmann distribution for our model with the known energy function \(E(s_1, ..., s_N)\):

\[p(s) = \frac{e^{-\beta E(s)}}{Z}\]

We assume that \(q\) is a family of distributions with another energy function that has parameters \(\lambda\):

\[q_\lambda(s) = \frac{e^{-\beta E_\lambda(s)}}{Z_q}\]

To measure how much information we lose when we use our approximation \(q\) instead of \(p\), we plug them into the Kullback-Leibler divergence:

\[\textrm{KL}(q_\lambda(s) \mid \mid p(s)) = \int q_\lambda(s) \log q_\lambda(s) - \int q_\lambda(s) \log \exp{(-\beta E(s))} + \log Z\] \[= \mathbb{E}_{q_\lambda} [\log q_\lambda(s)] - \mathbb{E}_{q_\lambda}[-\beta E(s)] + \log Z\] \[= -\mathcal{L(\lambda)} + \log Z\]

where we have defined the variational lower bound \(\mathcal{L}(\lambda)\) as

\[\mathcal{L}(\lambda) = \mathbb{E}_{q_\lambda}[-\beta E(s)] - \mathbb{E}_{q_\lambda}[\log q_\lambda(s) ]\]

We can move the variational lower bound to the other side of the equation to get the following identity:

\[\log Z = \textrm{KL}(q \mid\mid p) + \mathcal{L}(\lambda)\]

With Jensen’s inequality it is easy to show that the KL divergence is always greater than or equal to zero. This means that if we make \(\mathcal{L}(\lambda)\) bigger, the KL divergence must get smaller (i.e. our approximation must improve). Thus we can lower bound the partition function:

\[\log Z \geq \mathcal{L}(\lambda)\]

This means we can vary the parameters \(\lambda\) of our approximation to improve the lower bound, and get a better and better approximation to the partition function!

Note that in the definition of the variational lower bound, we do not need to worry about the arduous task of calculating the partition function: it does not depend on \(\lambda\).

This is awesome: we have constructed an approximation \(q_\lambda\) to our probability model \(p\) and found a way to vary its parameters so that our approximation gets better and better.

The interesting part is that we get can improve the approximation to our model \(p\) without calculating its intractable partition function. We only need to evaluate its energy function \(E(s)\) which is cheap to compute.

Is this too clever to be true? Have we surrendered anything? We have lost the ability to measure how good our approximation is, in absolute terms—for that, we still need to calculate the partition function to compute the KL divergence. We do know that as long as our lower bound \(\mathcal{L}(\lambda)\) increases as we vary \(\lambda\), our approximation gets better, and this is sufficient for a variety of problems.

Variational inference as the Gibbs-Bogoliubov-Feynman inequality!

Let’s see if this is the same as the Gibbs-Bogoliubov-Feynman inequality we saw in physics. Recall that the inequality is

\[Z \geq Z_{MF} \exp{[-\beta \langle E - E_{MF}\rangle_{MF}]}.\]

Taking logarithms:

\[\log Z \geq - \langle \beta E\rangle_{MF} + \langle \beta E_{MF}\rangle_{MF} + \log Z_{MF}\] \[\Rightarrow \log Z \geq \mathbb{E}_{q_\lambda}[-\beta E(s)] - \mathbb{E}_{q_\lambda} [\log q_\lambda(s) ]\] \[\Rightarrow \log Z \geq \mathcal{L}(\lambda)\]

Where we have identified that the variational family we are using, is the mean-field Boltzmann distribution \(q_\lambda(s) = \prod_i \frac{\exp(-\beta E_{MF}(s))}{Z_{MF}}\). Again, \(\lambda\) denotes the variational parameters that we vary to maximize the lower bound 8.

This shows that variational inference in machine learning—maximizing a lower bound on the partition function—is exactly the Gibbs-Bogoliubov-Feynman inequality in action.

The evidence lower bound in approximate posterior inference

In machine learning we care about patterns in data. This gives rise to the concept of latent variables, unobserved random variables that capture patterns in observed data.

For example, in linear regression we might posit a linear relationship between someone’s age and their income. This scalar coefficient captures a latent pattern that we seek to infer from many examples of (age, income) tuples.

We refer to a probability model as a model of latent variables \(z\) and data \(x\). The posterior distribution of latent variables given observed data is written \(p(z \mid x)\).

What is a posterior? In our regression example of the relationship between age and income, we want the posterior distribution of the regression coefficient after observing data. Our choice of prior on the coefficient is a modeling decision and reflects our belief about the statistical relationship we hope to observe.

The posterior is given by Bayes’ rule:

\[p(z \mid x) = \frac{p(x \mid z) p(z)}{\int p(x, z) dz}\]

The denominator is the evidence; the marginal distribution of the data: \(p(x) = \int p(x, z) dz\). This is the normalizer of the joint distribution of latent variables and data, or the partition function. This partition function is a sum over all configurations of random variables, and is intractable as we saw twice before.

Can we still do posterior inference despite the intractable partition function?

The refrain is familiar: we have an intractable sum in our partition function, but we can approximate it using the tools we developed earlier! Variational inference to the rescue. Let’s write out the variational lower bound on the partition function:

\[\log Z = \log p(x) \geq \mathcal{L}(\lambda) = \mathbb{E}_{q_\lambda}[\log p(x, z)] - \mathbb{E}_{q_\lambda}[\log q_\lambda (z)]\]

Again, by varying the parameters \(\lambda\) we can learn a good approximate posterior distribution \(q_\lambda(z)\) to approximate the posterior we care about but can’t calculate, \(p(z \mid x)\).

If we are using the variational method to learn an approximate posterior, our partition function is the evidence \(\log p(x)\). We thus refer to the variational lower bound \(\mathcal{L}(\lambda)\) as the Evidence Lower Bound or ELBO and speak of maximizing the ELBO to learn a good approximate posterior distribution.

This technique has been used in machine learning for the past two decades. It is becoming popular because intractable partition functions come with the need to analyze large datasets. Because the variational principle relies on optimizing a lower bound, the field has borrowed heavily from the optimization literature to scale Bayesian inference to massive data. It’s an exciting area, as new techniques from stochastic optimization may enable us to explore new physics and machine learning models.

Connections: are machine learning techniques useful in physics?

There are many techniques for approximating partition functions developed in the machine learning community that may find use in physics.

For example, black box variational inference and automatic differentiation variational inference are generic methods that may be useful in physics. They develop frameworks for constructing expressive approximate distributions and efficient optimization techniques.

Question for physicists familiar with variational methods: is stochastic optimization used in variational methods? Would this be useful?

Connections: could tools from physics be useful in machine learning?

Yes! The Gibbs-Bogoliubov-Feynman inequality was originally developed in physics and found its way to machine learning through Michael Jordan’s group at MIT in the 90s.

There seems to be a separate literature on constructing flexible families of distributions to approximate distributions. The replica trick, renormalization group theory, and others are just some topics that are beginning to make their way from statistical physics to machine learning.

Another example of tools from physics used in machine learning is operator variational inference. In this work, we developed a framework for constructing operators (such as the KL divergence) that measure how good an approximation is. The framework enables making explicit the tradeoffs between how good our approximation is and how much computation a variational method requires. The Langevin-Stein operator is equivalent to the Hamiltonian operator in physics (note) and was originally developed in a Physical Review Letters paper.

A fun question to ponder is “why KL divergence?” and the physics perspective is illuminating. It corresponds to the first-order Taylor expansion of the partition function and comes with assumptions about the non-equilibrium perturbed distribution. Does the second-order Taylor expansion correspond to another divergence and yield more accurate solutions?

I recently learned about replica theory. The replica trick is a technique for calculating the partition function of a system exactly, using an insane formula. It begs the question: what assumptions do we need to use this for probabilistic graphical models?

I’m excited to see more work in this area as physicists migrate to data science and machine learning.


How can we make transitions faster? How can we efficiently move techniques between machine learning and physics? Would code samples be helpful?

This post is an attempt at mapping the language from one community to another. Another idea is a long review paper that to give detailed examples of models solved within a statistical physics framework (with mean-field methods, replica theory, renormalization theory, etc) and solved with modern variational inference from a machine learning perspective (black box variational inference, stochastic optimization, etc). This would highlight how the fields complement each other.

Glossary
  • Expectations: the angle brackets \(\langle ~~\cdot~~\rangle\) denote an expectation. In the machine learning literature, this is denoted as \(\mathbb{E}_p[~~\cdot~~]\) for the expectation of a quantity with respect to the distribution \(p\). For example, \(\langle f(\vec{s}) \rangle\) denotes an expectation of a function of the spins \(f(\vec{s})\). The expectation is implicitly with respect to the Boltzmann distribution: \(\langle f(\vec{s}) \rangle = \mathbb{E}_p[f(\vec{s})] = \sum_{\{s_1, ..., s_N\}} f(\vec{s}) p(\vec{s})\) \(= \sum_{\{s_1, ..., s_N\}} f(\vec{s})=\frac{e^{-\beta H(\vec{s})}}{Z}\)
  • Spins in physics are called random variables in statistics and machine learning.
  • The evidence lower bound in variational inference is the negative free energy in physics terminology.

Anything to add or fix in this article to reduce confusion and increase clarity? Please email me, tweet, or submit a pull request.


References
  • Peterson & Anderson (1987) used solutions to time-dependent Ising models to learn the parameters of Boltzmann machines. This is a canonical reference for the ‘start’ of variational inference as it is known in the machine learning community.
  • You can go deep into Ising models: there are hundreds of lectures and references on line. Here are the sources I used for these notes: from Basel and Munich.
  • Dave’s course, Foundations of Graphical Models
  • Wainwright & Jordan (2008) is challenging but worthwhile.
  • David MacKay’s Information Theory, Inference, and Learning Algorithms has a section on variational free energy (Chapter 33, p. 422).
  • David Chandler’s Introduction to Modern Statistical Mechanics (1987) has a simple derivation of the variational free energy (Section 5.1, pp. 135-138) that I followed in this exposition.
  • Feynman, Statistical Mechanics - A set of lecture notes (1972) derives the variational free energy using a perturbation expansion (Section 2.11, pp. 67-71).
  • Parisi’s Statistical Field Theory (1988) derives the variational principle in three different ways (Section 3.2, pp. 24-31).
  • Matthew Beal’s thesis has interesting references, and Rich Turner has notes on correspondences between physics and machine learning.

Thanks to Bohdan Kulchytskyy, Florian Wentzel, Siddharth Mishra-Sharma, Smiti Kaul, Guillaume Verdon, Henri Palacci, Sam Ritter, Mattias Fitzpatrick, and Sophie Kleber for comments and encouragement. Image credits: Freepik for iconography, and Analytical Scientific for the Newton’s cradle image.

Addendum

This blog post ended up seeding the first several chapters of my thesis.

Footnotes
  1. Derivation 

  2. For a tiny system, e.g. with three spins, we have \(8\) states and the sum is doable - but the system is uninteresting. 

  3. For example, the magnetization of dysprosium aluminium garnet at low temperatures is exactly described by this model. 

  4. To see this, recall that \(\cosh x = \frac{e^x + e^{-x}}{2}\) 

  5. Visual proof that \(e^x \geq x + 1\). 

  6. The semicolon notation means “the distribution over \(x\) is parameterized in terms of the parameter \(\pi\)”. 

  7. Writing the parameters of a distribution as a subscript (\(p_\theta(s)\)) is shorthand for writing them after the semicolon (\(p(s; \theta)\)). 

  8. In the variational treatment of the Ising model we had one variational parameter, the perturbation to the static magnetic field \(\lambda = \Delta H\). 

How does physics connect to machine learning? was originally published by Jaan Lı at Jaan Lı on August 11, 2017.

https://jaan.io/how-does-physics-connect-machine-learning
food2vec - Augmented cooking with machine intelligence
Show full content

TL;DR: Check out the tools demo to explore food analogies and recommendations, or scroll down for an interactive map of a hundred thousand recipes from around the world.

I haven’t eaten in five days. I dream of food. I study food. Deep in ketosis, my body has adapted to consume itself: I am food. There is no better time to dig into modeling grub.

Machine intelligence has changed your life, from how you listen to music through Discover Weekly playlists, consume news through Facebook, or talk to your hand computer’s friendly digital assistant. But why hasn’t it changed how we eat? Can we modify the ingredients of language processing algorithms to get insights about food? If you tell me what you want to eat, can I recommend complementary foods, much like Spotify recommends complementary songs?

Word embeddings are a useful technique for analyzing discrete data. Say we use \(170,000\) words from the Oxford English dictionary. We can represent each word (such as “food”) as a vector as follows: a list of \(169,999\) zeros, with a single \(1\) at the location of the word in the vocabulary. In our case, “food” may be at location \(29,163\) near other words beginning with the letter f. Then the vector for “food” would look like:

\[[0, 0, 0, ..., 0, 0, 1, 0, 0, ..., 0].\]

However, this is inadequate for comparing words. To compare documents and get useful insights from our data, we need to aggregate over \(170,000\) dimensions for each word, which takes far too long. Can we do better?

Embeddings let us reduce the dimensionality of the problem, and give us a powerful representation of language. We can build a model of language where we assign a hundred random numbers to each word. To train the model, we use these hundred numbers of each word to predict their context. The “context” of a word consists of its surrounding words. This is the main idea: the context means that words that occur in similar contexts should have similar meanings. We tweak the numbers assigned to a word to make them better at predicting words in the context. Initially, the random numbers assigned to a word will be bad at predicting words in the context. But gradually, through this process of tweaking the model’s predictions of surrounding words, we get a hundred numbers that are far from random. The hundred numbers representing each word will capture part of its meaning: similar words will cluster together because they occur in each other’s contexts, and words with different meanings are pushed far apart (out-of-context). By representing each word as an embedding in \(100\) dimensions, we have reduced the dimensionality more than a thousandfold from \(170,000\) and gained a better representation of language.

For modeling food, we have a collection of recipes. We can define the context of an ingredient in a recipe to be the rest of the foods in the recipe. This demonstrates the flexibility of embeddings: by making a small change in the definition of the context, we can now apply it to a totally different kind of data.

Food similarity map

After training the embedding algorithm on a collection of \(95, 896\) recipes, we get \(100\)-dimensional embeddings for each food. Humans can’t visualize high dimensions, so we use an approximation technique to visualize similarity between the foods in two dimensions.

Here is a similarity map of the \(2,087\) ingredients used in the recipes. Hover over a point to see which food it represents:

The map of foods is reasonable. Ingredients from Asia cluster together, as do ingredients used in European and North American cooking.

Recipe embedding map

We can generate an embedding for a recipe by taking the average of its ingredients’ embeddings. Here is a map of \(95, 896\) recipes from around the world. Hover over a point to see the recipe, and click on the cuisine legend on the right to show or hide certain regions:

IMPORTANT: you are about to download 15MB of data. Click here to access the map, zoom in, and discover new flavors. Is this the fastest way to browse 100k recipes by similarity?

Interesting patterns emerge. Asian recipes cluster together, as do Southern European recipes. Northern European and American foods are all over the place, maybe because of transmission of recipes due to migration, or over-representation in the data.

Food similarity tool

Access the tool at this link. We can calculate food similarity by looking at which food is closest in the high dimensional space in the embeddings.

These mostly make sense - foods are closest to other foods they appear with in recipes:

  • Cheese is closest to macaroni
  • Sesame oil is closest to egg noodle
  • Milk is closest to nutmeg
  • Olive oil is closest to parmesan cheese
Food analogy tool

Access the tool here. Food analogies, like word analogies, are calculated with vector arithmetic. For the analogy “Food A is to food B, as food C is to food D”, the goal is to predict a reasonable food D. We can do this by subtracting food B from food A, then adding food C. For example, calculating \((bacon - egg) + orangejuice\) in embedding space will yield an embedding. The closest embedding to this is \(coffee\) in our model of food. The classic example from word embeddings is \((king - man) + woman = queen\). Is this intuitive? King is to man as woman is to queen makes sense in natural language, but food analogies are less clear. With practice, we may be able to train our taste detectors and devise hypotheses to test in the realm of food. I also included cuisine embeddings by representing them as the average of their recipes’ embeddings.

Some of these are more plausible than others:

  • Egg is to bacon as orange juice is to coffee.
  • Bread is to butter as roast beef is to sage.
  • Smoked salmon is to dill as lamb is to asparagus.
  • South Asian is to rice as Southern European is to thyme.
  • Rice is to sesame seed as macaroni is to pimento.
  • Roasted beef is to green bell pepper as pork sausage is to fenugreek.
Recipe recommendation tool

Access the tool here. We can use our model of food as a recommendation system for cooks. By taking the average embedding for a set of foods, we can look up foods with the closest embeddings.

For example, I am a lifelong aficionado of peanut butter jam sandwiches. I entered my usual favorite: white bread, butter, peanut butter, honey. The top recommendation was: strawberry. I’ve never tried that, and it’s pretty good! I happily broke my fast with it. For the recipe of lamb, cumin, tomato, the top recommendation is raisin - also reasonable and interesting. Other recommendations are a bit wackier, so best of luck.

If you end up adding an ingredient to your food based on these tools, I’d love to hear how it tasted: ping me on Twitter or email!

What’s next?
  • Figuring out the right user interface to explore these models. The code for the plots and recommendation tools is on github. It would be great to make these mobile-friendly and test other ways of presenting recommendations from the model to users.
  • word2vec is not the best model for this. Multi-class regression should work well, and I added a working demo of this to the repo. This is a rare case where the vocabulary size (number of ingredients) is very small, so we can fit both models and compare them. This could reveal idiosyncrasies in the non-contrastive estimation loss used in word2vec and provides an interesting testbed.
  • Scaling up the data: Do you have a larger dataset of recipes, or do you know how to scrape one? I’d love to check it out. This would also fix bias in the data as the majority of the recipes are currently North American.
  • Testing out recipe analogies combined with food analogies: this may be more intuitive for us humans. For example, “pancakes are to maple syrup, as an omelette is to cheese” could be easier to think about than analogies with individual ingredients.
Resources
  • This NYT piece, The Great AI Awakening, does a much better job at describing embeddings than I can
  • Wesley has a neat paper on a similar approach: diet2vec
  • Sanjeev Arora’s research has good explanations for the analogy properties of embeddings
  • The t-SNE algorithm for visualizing high-dimensional embeddings
  • The original Nature Scientific Report with the data
  • Dave taught a fantastic class that helped me understand embeddings
  • Maja’s paper on exponential family embeddings generalizes word2vec to other distributions that would be neat to try on this data (word2vec can be interpreted as a Bernoulli embedding model with biased gradients)

Thanks to David Blei for the idea, Peter Bearman for presenting his work to our group, MealMakeOverMoms for the mise photo, Anthony for open-sourcing the embedding browser on which ours is based, and Plotly for open-sourcing their fantastic plotting library.

Feel free to ping me on Twitter or email with feedback or ideas!

Discussion on Hacker News and Reddit. Also see slides from a talk at the New York Times on this project.

food2vec - Augmented cooking with machine intelligence was originally published by Jaan Lı at Jaan Lı on January 22, 2017.

https://jaan.io/food2vec-augmented-cooking-machine-intelligence
Variational Autoencoder Perspectives.md
Show full content
### Takeaway: why the neural net perspective limits us I hope you are convinced that reasoning about the variational autoencoder is less ambiguous and less confusing from the perspective of variational inference in probability models. In neural net language, the variational autoencoder refers to an encoder, a decoder, and a loss function. In probability model terms, the variational autoencoder refers to approximate inference in a latent Gaussian model, where the approximate posterior and model likelihood are parametrized by neural nets (the inference and generative networks). The sentence describing the variational autoencoder in neural net terms is unclear: What is the encoder? What does the decoder mean? What is the loss function? Each term requires further explanation. In contrast, the probability model language gives us an objective function (the ELBO) for free, and we can simply state that we parametrize the approximate posterior and model with neural nets. Here are more reasons why we should favor the probability model perspective on variational autoencoders: * *Separating model and inference*: Shakir [makes this point well](http://blog.shakirm.com/2015/03/a-statistical-view-of-deep-learning-ii-auto-encoders-and-free-energy/). Rather than being limited to an 'encoder' in neural net terms, we can think of the probability model at hand, $$ p(x, z) $$ separately from the approximate inference scheme. This lets us choose from a variety of methods, rather than thinking only in terms of amortized inference using a neural net. It is our choice whether to explore other (perhaps better) methods such as mean-field variational inference or MCMC/HMC/Langevin dynamics to learn the parameters of the model. * *Composability*: the moment we add a second layer of latent variables to our model that depend on the first layer, the encoder/decoder framework breaks down. How should we parametrize the inference network? Can we still do amortized inference? The framework of probability models can help us use build more complex models from basic building blocks, and gives us clear frameworks for how to do inference. Thinking in terms of encoders is dangerous for top-down inference, as it is unclear how to parametrize the encoder for any more than one layer of latent variables. * *Regularization is free*: in neural net terms, we discussed 'regularizer' term in the loss function (the KL divergence between the approximate posterior and prior). This comes out of the blue if one is not familiar with variational inference. But in probability model language, it is simply and alternate form of the ELBO, and we can immediately think about alternative priors that may be more appropriate for the data we wish to model.

Variational Autoencoder Perspectives.md was originally published by Jaan Lı at Jaan Lı on July 24, 2016.

https://jaan.io/variational-autoencoder-perspectives.md
Tutorial - What is a variational autoencoder?
Show full content

Why do deep learning researchers and probabilistic machine learning folks get confused when discussing variational autoencoders? What is a variational autoencoder? Why is there unreasonable confusion surrounding this term?

There is a conceptual and language gap. The sciences of neural networks and probability models do not have a shared language. My goal is to bridge this idea gap and allow for more collaboration and discussion between these fields, and provide a consistent implementation (Github link). If many words here are new to you, jump to the glossary.

Variational autoencoders are cool. They let us design complex generative models of data, and fit them to large datasets. They can generate images of fictional celebrity faces and high-resolution digital artwork.

Variational autoencoder applied to faces.
Fictional celebrity faces generated by a variational autoencoder (by Alec Radford).

These models also yield state-of-the-art machine learning results in image generation and reinforcement learning. Variational autoencoders (VAEs) were defined in 2013 by Kingma et al. and Rezende et al..

How can we create a language for discussing variational autoencoders? Let’s think about them first using neural networks, then using variational inference in probability models.

The neural net perspective

In neural net language, a variational autoencoder consists of an encoder, a decoder, and a loss function.

The encoder compresses data into a latent space (z). The decoder reconstructs the data given the hidden representation.

The encoder is a neural network. Its input is a datapoint \(x\), its output is a hidden representation \(z\), and it has weights and biases \(\theta\). To be concrete, let’s say \(x\) is a 28 by 28-pixel photo of a handwritten number. The encoder ‘encodes’ the data which is \(784\)-dimensional into a latent (hidden) representation space \(z\), which is much less than \(784\) dimensions. This is typically referred to as a ‘bottleneck’ because the encoder must learn an efficient compression of the data into this lower-dimensional space. Let’s denote the encoder \(q_\theta (z \mid x)\). We note that the lower-dimensional space is stochastic: the encoder outputs parameters to \(q_\theta (z \mid x)\), which is a Gaussian probability density. We can sample from this distribution to get noisy values of the representations \(z\).

The decoder is another neural net. Its input is the representation \(z\), it outputs the parameters to the probability distribution of the data, and has weights and biases \(\phi\). The decoder is denoted by \(p_\phi(x\mid z)\). Running with the handwritten digit example, let’s say the photos are black and white and represent each pixel as \(0\) or \(1\). The probability distribution of a single pixel can be then represented using a Bernoulli distribution. The decoder gets as input the latent representation of a digit \(z\) and outputs \(784\) Bernoulli parameters, one for each of the \(784\) pixels in the image. The decoder ‘decodes’ the real-valued numbers in \(z\) into \(784\) real-valued numbers between \(0\) and \(1\). Information from the original \(784\)-dimensional vector cannot be perfectly transmitted, because the decoder only has access to a summary of the information (in the form of a less-than-\(784\)-dimensional vector \(z\)). How much information is lost? We measure this using the reconstruction log-likelihood \(\log p_\phi (x\mid z)\) whose units are nats. This measure tells us how effectively the decoder has learned to reconstruct an input image \(x\) given its latent representation \(z\).

The loss function of the variational autoencoder is the negative log-likelihood with a regularizer. Because there are no global representations that are shared by all datapoints, we can decompose the loss function into only terms that depend on a single datapoint \(l_i\). The total loss is then \(\sum_{i=1}^N l_i\) for \(N\) total datapoints. The loss function \(l_i\) for datapoint \(x_i\) is:

\[l_i(\theta, \phi) = - \mathbb{E}_{z\sim q_\theta(z\mid x_i)}[\log p_\phi(x_i\mid z)] + \mathbb{KL}(q_\theta(z\mid x_i) \mid\mid p(z))\]

The first term is the reconstruction loss, or expected negative log-likelihood of the \(i\)-th datapoint. The expectation is taken with respect to the encoder’s distribution over the representations. This term encourages the decoder to learn to reconstruct the data. If the decoder’s output does not reconstruct the data well, statistically we say that the decoder parameterizes a likelihood distribution that does not place much probability mass on the true data. For example, if our goal is to model black and white images and our model places high probability on there being black spots where there are actually white spots, this will yield the worst possible reconstruction. Poor reconstruction will incur a large cost in this loss function.

The second term is a regularizer that we throw in (we’ll see how it’s derived later). This is the Kullback-Leibler divergence between the encoder’s distribution \(q_\theta(z\mid x)\) and \(p(z)\). This divergence measures how much information is lost (in units of nats) when using \(q\) to represent \(p\). It is one measure of how close \(q\) is to \(p\).

In the variational autoencoder, \(p\) is specified as a standard Normal distribution with mean zero and variance one, or \(p(z) = Normal(0,1)\). If the encoder outputs representations \(z\) that are different than those from a standard normal distribution, it will receive a penalty in the loss. This regularizer term means ‘keep the representations \(z\) of each digit sufficiently diverse’. If we didn’t include the regularizer, the encoder could learn to cheat and give each datapoint a representation in a different region of Euclidean space. This is bad, because then two images of the same number (say a 2 written by different people, \(2_{alice}\) and \(2_{bob}\)) could end up with very different representations \(z_{alice}, z_{bob}\). We want the representation space of \(z\) to be meaningful, so we penalize this behavior. This has the effect of keeping similar numbers’ representations close together (e.g. so the representations of the digit two \({z_{alice}, z_{bob}, z_{ali}}\) remain sufficiently close).

We train the variational autoencoder using gradient descent to optimize the loss with respect to the parameters of the encoder and decoder \(\theta\) and \(\phi\). For stochastic gradient descent with step size \(\rho\), the encoder parameters are updated using \(\theta \leftarrow \theta - \rho \frac{\partial l}{\partial \theta}\) and the decoder is updated similarly.

The probability model perspective

Now let’s think about variational autoencoders from a probability model perspective. Please forget everything you know about deep learning and neural networks for now. Thinking about the following concepts in isolation from neural networks will clarify things. At the very end, we’ll bring back neural nets.

In the probability model framework, a variational autoencoder contains a specific probability model of data \(x\) and latent variables \(z\). We can write the joint probability of the model as \(p(x, z) = p(x \mid z) p(z)\). The generative process can be written as follows.

For each datapoint \(i\):

  • Draw latent variables \(z_i \sim p(z)\)
  • Draw datapoint \(x_i \sim p(x\mid z)\)

We can represent this as a graphical model:

The graphical model representation of the model in the variational autoencoder. The latent variable z is a standard normal, and the data are drawn from p(x|z). The shaded node for X denotes observed data. For black and white images of handwritten digits, this data likelihood is Bernoulli distributed.

This is the central object we think about when discussing variational autoencoders from a probability model perspective. The latent variables are drawn from a prior \(p(z)\). The data \(x\) have a likelihood \(p(x \mid z)\) that is conditioned on latent variables \(z\). The model defines a joint probability distribution over data and latent variables: \(p(x, z)\). We can decompose this into the likelihood and prior: \(p(x,z) = p(x\mid z)p(z)\). For black and white digits, the likelihood is Bernoulli distributed.

Now we can think about inference in this model. The goal is to infer good values of the latent variables given observed data, or to calculate the posterior \(p(z \mid x)\). Bayes says:

\[p(z \mid x) = \frac{p(x \mid z)p(z)}{p(x)}.\]

Examine the denominator \(p(x)\). This is called the evidence, and we can calculate it by marginalizing out the latent variables: \(p(x) = \int p(x \mid z) p(z) dz\). Unfortunately, this integral requires exponential time to compute as it needs to be evaluated over all configurations of latent variables. We therefore need to approximate this posterior distribution.

Variational inference approximates the posterior with a family of distributions \(q_\lambda(z \mid x)\). The variational parameter \(\lambda\) indexes the family of distributions. For example, if \(q\) were Gaussian, it would be the mean and variance of the latent variables for each datapoint \(\lambda_{x_i} = (\mu_{x_i}, \sigma^2_{x_i}))\).

How can we know how well our variational posterior \(q(z \mid x)\) approximates the true posterior \(p(z \mid x)\)? We can use the Kullback-Leibler divergence, which measures the information lost when using \(q\) to approximate \(p\) (in units of nats):

\[\mathbb{KL}(q_\lambda(z \mid x) \mid \mid p(z \mid x)) =\] \[\mathbf{E}_q[\log q_\lambda(z \mid x)]- \mathbf{E}_q[\log p(x, z)] + \log p(x)\]

Our goal is to find the variational parameters \(\lambda\) that minimize this divergence. The optimal approximate posterior is thus

\[q_\lambda^* (z \mid x) = {\arg\min}_\lambda \mathbb{KL}(q_\lambda(z \mid x) \mid \mid p(z \mid x)).\]

Why is this impossible to compute directly? The pesky evidence \(p(x)\) appears in the divergence. This is intractable as discussed above. We need one more ingredient for tractable variational inference. Consider the following function:

\[ELBO(\lambda) = \mathbf{E}_q[\log p(x, z)] - \mathbf{E}_q[\log q_\lambda(z \mid x)].\]

Notice that we can combine this with the Kullback-Leibler divergence and rewrite the evidence as

\[\log p(x) = ELBO(\lambda) + \mathbb{KL}(q_\lambda(z \mid x) \mid \mid p(z \mid x))\]

By Jensen’s inequality, the Kullback-Leibler divergence is always greater than or equal to zero. This means that minimizing the Kullback-Leibler divergence is equivalent to maximizing the ELBO. The abbreviation is revealed: the Evidence Lower BOund allows us to do approximate posterior inference. We are saved from having to compute and minimize the Kullback-Leibler divergence between the approximate and exact posteriors. Instead, we can maximize the ELBO which is equivalent (but computationally tractable).

In the variational autoencoder model, there are only local latent variables (no datapoint shares its latent \(z\) with the latent variable of another datapoint). So we can decompose the ELBO into a sum where each term depends on a single datapoint. This allows us to use stochastic gradient descent with respect to the parameters \(\lambda\) (important: the variational parameters are shared across datapoints - more on this here). The ELBO for a single datapoint in the variational autoencoder is:

\[ELBO_i(\lambda) = \mathbb{E}{q_\lambda(z\mid x_i)}[\log p(x_i\mid z)] - \mathbb{\mathbb{KL}}(q_\lambda(z\mid x_i) \mid\mid p(z)).\]

To see that this is equivalent to our previous definition of the ELBO, expand the log joint into the prior and likelihood terms and use the product rule for the logarithm.

Let’s make the connection to neural net language. The final step is to parametrize the approximate posterior \(q_\theta (z \mid x, \lambda)\) with an inference network (or encoder) that takes as input data \(x\) and outputs parameters \(\lambda\). We parametrize the likelihood \(p(x \mid z)\) with a generative network (or decoder) that takes latent variables and outputs parameters to the data distribution \(p_\phi(x \mid z)\). The inference and generative networks have parameters \(\theta\) and \(\phi\) respectively. The parameters are typically the weights and biases of the neural nets. We optimize these to maximize the ELBO using stochastic gradient descent (there are no global latent variables, so it is kosher to minibatch our data). We can write the ELBO and include the inference and generative network parameters as:

\[ELBO_i(\theta, \phi) = \mathbb{E}{q_\theta(z\mid x_i)}[\log p_\phi(x_i\mid z)] - \mathbb{KL}(q_\theta(z\mid x_i) \mid\mid p(z)).\]

This evidence lower bound is the negative of the loss function for variational autoencoders we discussed from the neural net perspective; \(ELBO_i(\theta, \phi) = -l_i(\theta, \phi)\). However, we arrived at it from principled reasoning about probability models and approximate posterior inference. We can still interpret the Kullback-Leibler divergence term as a regularizer, and the expected likelihood term as a reconstruction ‘loss’. But the probability model approach makes clear why these terms exist: to minimize the Kullback-Leibler divergence between the approximate posterior \(q_\lambda(z \mid x)\) and model posterior \(p(z \mid x)\).

What about the model parameters? We glossed over this, but it is an important point. The term ‘variational inference’ usually refers to maximizing the ELBO with respect to the variational parameters \(\lambda\). We can also maximize the ELBO with respect to the model parameters \(\phi\) (e.g. the weights and biases of the generative neural network parameterizing the likelihood). This technique is called variational EM (expectation maximization), because we are maximizing the expected log-likelihood of the data with respect to the model parameters.

That’s it! We have followed the recipe for variational inference. We’ve defined:

  • a probability model \(p\) of latent variables and data
  • a variational family \(q\) for the latent variables to approximate our posterior

Then we used the variational inference algorithm to learn the variational parameters (gradient ascent on the ELBO to learn \(\lambda\)). We used variational EM for the model parameters (gradient ascent on the ELBO to learn \(\phi\)).

Experiments

Now we are ready to look at samples from the model. We have two choices to measure progress: sampling from the prior or the posterior. To give us a better idea of how to interpret the learned latent space, we can visualize what the posterior distribution of the latent variables \(q_\lambda(z \mid x)\) looks like.

Computationally, this means feeding an input image \(x\) through the inference network to get the parameters of the Normal distribution, then taking a sample of the latent variable \(z\). We can plot this during training to see how the inference network learns to better approximate the posterior distribution, and place the latent variables for the different classes of digits in different parts of the latent space. Note that at the start of training, the distribution of latent variables is close to the prior (a round blob around \(0\)).

Visualizing the learned approximate posterior during training. As training progresses the digit classes become differentiated in the two-dimensional latent space.

We can also visualize the prior predictive distribution. We fix the values of the latent variables to be equally spaced between \(-3\) and \(3\). Then we can take samples from the likelihood parametrized by the generative network. These ‘hallucinated’ images show us what the model associates with each part of the latent space.

Visualizing the prior predictive distribution by looking at samples of the likelihood. The x and y-axes represent equally spaced latent variable values between -3 and 3 (in two dimensions).
Glossary

We need to decide on the language used for discussing variational autoencoders in a clear and concise way. Here is a glossary of terms I’ve found confusing:

  • Variational Autoencoder (VAE): in neural net language, a VAE consists of an encoder, a decoder, and a loss function. In probability model terms, the variational autoencoder refers to approximate inference in a latent Gaussian model where the approximate posterior and model likelihood are parametrized by neural nets (the inference and generative networks).
  • Loss function: in neural net language, we think of loss functions. Training means minimizing these loss functions. But in variational inference, we maximize the ELBO (which is not a loss function). This leads to awkwardness like calling optimizer.minimize(-elbo) as optimizers in neural net frameworks only support minimization.
  • Encoder: in the neural net world, the encoder is a neural network that outputs a representation \(z\) of data \(x\). In probability model terms, the inference network parametrizes the approximate posterior of the latent variables \(z\). The inference network outputs parameters to the distribution \(q(z \mid x)\).
  • Decoder: in deep learning, the decoder is a neural net that learns to reconstruct the data \(x\) given a representation \(z\). In terms of probability models, the likelihood of the data \(x\) given latent variables \(z\) is parametrized by a generative network. The generative network outputs parameters to the likelihood distribution \(p(x \mid z)\).
  • Local latent variables: these are the \(z_i\) for each datapoint \(x_i\). There are no global latent variables. Because there are only local latent variables, we can easily decompose the ELBO into terms \(\mathcal{L}_i\) that depend only on a single datapoint \(x_i\). This enables stochastic gradient descent.
  • Inference: in neural nets, inference usually means prediction of latent representations given new, never-before-seen datapoints. In probability models, inference refers to inferring the values of latent variables given observed data.

One jargon-laden concept deserves its own subsection:

Mean-field versus amortized inference

This issue was very confusing for me, and I can see how it might be even more confusing for someone coming from a deep learning background. In deep learning, we think of inputs and outputs, encoders and decoders, and loss functions. This can lead to fuzzy, imprecise concepts when learning about probabilistic modeling.

Let’s discuss how mean-field inference differs from amortized inference. This is a choice we face when doing approximate inference to estimate a posterior distribution of latent variables. We might have various constraints: do we have lots of data? Do we have big computers or GPUs? Do we have local, per-datapoint latent variables, or global latent variables shared across all datapoints?

Mean-field variational inference refers to a choice of a variational distribution that factorizes across the \(N\) data points, with no shared parameters:

\[q(z) = \prod_i^{N} q(z_i; \lambda_i)\]

This means there are free parameters for each datapoint \(\lambda_i\) (e.g. \(\lambda_i = (\mu_i, \sigma_i)\) for Gaussian latent variables). How do we do ‘learning’ for a new, unseen datapoint? We need to maximize the ELBO for each new datapoint, with respect to its mean-field parameter(s) \(\lambda_i\).

Amortized inference refers to ‘amortizing’ the cost of inference across datapoints. One way to do this is by sharing (amortizing) the variational parameters \(\lambda\) across datapoints. For example, in the variational autoencoder, the parameters \(\theta\) of the inference network. These global parameters are shared across all datapoints. If we see a new datapoint and want to see what its approximate posterior \(q(z_i)\) looks like, we can run variational inference again (maximizing the ELBO until convergence), or trust that the shared parameters are ‘good-enough’. This can be an advantage over mean-field.

Which one is more flexible? Mean-field inference is strictly more expressive, because it has no shared parameters. The per-data parameters \(\lambda_i\) can ensure our approximate posterior is most faithful to the data. Another way to think of this is that we are limiting the capacity or representational power of our variational family by tying parameters across datapoints (e.g. with a neural network that shares weights and biases across data).

Sample PyTorch/TensorFlow implementation

Here is the implementation that was used to generate the figures in this post: Github link

Footnote: the reparametrization trick

The final thing we need to implement the variational autoencoder is how to take derivatives with respect to the parameters of a stochastic variable. If we are given \(z\) that is drawn from a distribution \(q_\theta (z \mid x)\), and we want to take derivatives of a function of \(z\) with respect to \(\theta\), how do we do that? The \(z\) sample is fixed, but intuitively its derivative should be nonzero.

For some distributions, it is possible to reparametrize samples in a clever way, such that the stochasticity is independent of the parameters. We want our samples to deterministically depend on the parameters of the distribution. For example, in a normally-distributed variable with mean \(\mu\) and standard devation \(\sigma\), we can sample from it like this:

\[z = \mu + \sigma \odot \epsilon,\]

where \(\epsilon \sim Normal(0, 1)\). Going from \(\sim\) denoting a draw from the distribution to the equals sign \(=\) is the crucial step. We have defined a function that depends on the parameters deterministically. We can thus take derivatives of functions involving \(z\), \(f(z)\) with respect to the parameters of its distribution \(\mu\) and \(\sigma\).

The reparametrization trick allows us to push the randomness of a normally-distributed random variable z into epsilon, which is sampled from a standard normal. Diamonds indicate deterministic dependencies, circles indicate random variables.

In the variational autoencoder, the mean and variance are output by an inference network with parameters \(\theta\) that we optimize. The reparametrization trick lets us backpropagate (take derivatives using the chain rule) with respect to \(\theta\) through the objective (the ELBO) which is a function of samples of the latent variables \(z\).

Further reading and improvements
  • If we are careful, the Bernoulli likelihood is an incorrect choice for the MNIST dataset. The handwritten digits are `close’ to binary-valued, but are in fact continuous. This paper fixes the issue with the continuous Bernoulli distribution.

Is anything in this article confusing or can any explanation be improved? Please submit a pull request, tweet me, or email me :)


References for ideas and figures

Many ideas and figures are from Shakir Mohamed’s excellent blog posts on the reparametrization trick and autoencoders. Durk Kingma created the great visual of the reparametrization trick. Great references for variational inference are this tutorial and David Blei’s course notes. Dustin Tran has a helpful blog post on variational autoencoders. The header’s molecule samples generated from a variational autoencoder are from this paper.

Thanks to Rajesh Ranganath, Andriy Mnih, Ben Poole, Jon Berliner, Cassandra Xia, and Ryan Sepassi for discussions and many concepts in this article. Thanks to Batuhan Koyuncu for regenerating the GIFs!

Discussion on Hacker News and Reddit. Featured in David Duvenaud’s course syllabus on “Differentiable inference and generative models”.

Cite this work: DOI

Tutorial - What is a variational autoencoder? was originally published by Jaan Lı at Jaan Lı on July 18, 2016.

https://jaan.io/what-is-variational-autoencoder-vae-tutorial
Experiments in information overload
Show full content

Why read clickbait over longform journalism? Why use Facebook so much?

Screentime is bad, but the attention economy incentivizes our addiction.

I’m trying to reduce information overload. Here are some methods I’ve used for a while.

Thoughts on benefits, pitfalls, and other ideas? Hit me up at jaan@onefact.org. I’ll keep this updated.

The current list:

Body & Mind
  • For complete blocking of ambient noise, 3M Peltor industrial earmuffs over in-ear noise-canceling headphones give the most reduction I’ve found in crowded NYC subways (around 40-50 decibels; to the point of being dangerous on the streets).
  • Tara Brach’s guided meditations and podcasts are awesome for downtime and self-therapy during the week. Ditto for the Headspace and Waking Up apps, and biofeedback training using the Elite HRV app.
  • Running to places to save time (I sweat a lot so buy technical clothing that dries fast).
  • Using CitiBike as often as possible (clocked about 1300km in NYC in 2017).
Interpersonal
  • Training yourself to speak faster to save time in service interactions, like at coffee shops.
  • Training in motivational interviewing, active listening, negotiation, and related techniques to improve outcomes and help in crisis situations (good practice: active listening & negotiation techniques when stuck on the phone with big companies).
  • Building rituals for socializing: making a lengthy manual espresso at work, cooking, sauna, social dance (swing, forró).
Products
  • Sign up for LOT2046 to outsource decisions about clothing and apparel.
  • Use Wirecutter as often as possible when buying products.
Media diet
  • Once a year: skim RSS feeds using Feedly, save to Instapaper. I like Longform’s curation service of high quality long-form journalism and essays.
  • No TV, and about five movies per year. If I’m forced to watch movies/TV/other media, watch it at 2x (it’s a bit harder to follow at first on these faster speeds; start at 1.5x and work your way up). Ditto for podcasts and lectures: 2x saves a lot of time with little loss in retention.
  • Every few months: skimming the top 20 posts of each day on Hacker News using HckrNews. And catching up on the recommendations from longform.org (it feels like a slower-paced, more intellectual news source). Interesting links go to Instapaper. Doing this in batches helps filter useful things from hype cycle fare.
  • No news except for the Harper’s Weekly Review delivered to my inbox.
  • Some podcasts (currently: Conversations with Tyler [Cowen], The Tim Ferriss Show, the Longform podcast).
Diet
  • Soylent or Huel as a meal replacement, in those times when I’m about to get some unhealthy fried food because `I don’t have time’ to cook.
Phone
  • Use Android’s excellent digital wellbeing features such as app timers and wind down.
  • Many tricks from ‘This is your brain on mobile’ like no phone notifications and no social media apps. I’ve tried keeping the Chrome app disabled; ideally, I would root my phone and remove it permanently.
  • Using a pay-as-you-go phone plan (right now, Google Fi). Paying for data by the megabyte helps prevent overuse.
  • Using the kitchen safe to lock the phone during work hours.
Computer
  • Two devices: a ‘work’ laptop and an ‘email/distraction’ device. This is currently a Macbook Air. The cost-benefit analysis of having two devices is worth it for me: I will happily pay a few hundred dollars for a bad computer on which to use email and gain hours of focused time every week.
  • On the work laptop, permanently block Gmail and any distracting sites (using SelfControl with extended block lengths). Currently: only use email after 4:30pm every day.
  • Looking at Rescuetime logs periodically and updating my SelfControl blacklist with distracting websites.
  • Making all desktop backgrounds pictures of your aged face (through an app). This can add immediacy and help you make better decisions. Or a death countdown timer based on actuarial tables.
  • Using countdown apps to keep the number of days to a big deadline visible every day. This can help connect your present self to your future self.
Devices in general
  • Leaving all devices at work as often as possible. Then I’m totally disconnected and forced to read, be alone, and take a real break.
Social media
  • On Facebook, systematically unfollowing everyone in the newsfeed (this took an hour of clicking, but was well worth it). An empty newsfeed is revelatory: if I’m truly interested in someone, I’ll go to their profile page.
  • Keeping Twitter, Facebook, etc. blocked for weeks at a time helps build up tolerance to inactivity.
Benefits and pitfalls

When stuck I’m forced to go to lame sites (the interesting ones like Gmail and Hacker News are blocked). I can’t New Tab away from boredom and have to sit with the discomfort.

I’m getting better staying with this existential fear-of-failure crisis of ‘I’m stuck’. Paying attention to the discomfort helps; eventually it dissipates and interesting ideas appear. But this doesn’t happen if I’m constantly checking emails, sites, or phone notifications.

Severely limiting my information intake means I read more and get bored more often. It takes a few days to hear about big news sometimes.

How could information filtering be applied to the physical world? I’d love to see the collections at the Met, MoMA, and other NYC museums, but there is no time. I need a recommendation system for this equivalent to Instapaper or Longform. Museum collections and physical things do not have a ‘save for later’ feature. Adblock for the real world isn’t here yet (c.f. augmented reality circa 2020). Meanwhile, can we mentally train to avoid advertising?

If everyone used these weird hacks, no one would make money online and there would be no one to upvote things on Hacker News or Reddit; no one to ‘like’. These techniques might leave me more (or less) susceptible to filter bubbles, but it’s a fun experiment.

Could recommendation systems be the solution to these issues? Would this accentuate the problem? Is there even a problem, or is this part of the ‘busy’ trap?

Do let me know about alternative ideas or things to try at jaan@onefact.org!

Related reading
  • Stuart Whatley on how smartphones cause anxiety and the psychology of boredom.

Experiments in information overload was originally published by Jaan Lı at Jaan Lı on November 27, 2015.

https://jaan.io/info-overload
Useful Science
Show full content

One sentence summaries of science to improve your life (incl. podcast). Live at usefulscience.org. I helped write an editorial about our mission, and Maryse recently gave an excellent overview keynote presentation at the Pacific Northwest Communicating Science conference.

Featured on various news websites; ‘received funding’ on CBC Television’s Dragons’ Den (link to episode).

Useful Science was originally published by Jaan Lı at Jaan Lı on November 26, 2015.

https://jaan.io/useful-science
Princeton Pianos
Show full content

I’ve played more piano since starting grad school than throughout my four years at McGill, thanks to the abundance of pianos on campus. The pianos’ conditions range from excellent to passable, with some sporting Köln-level limitations. However, I couldn’t find a resource listing their locations.

Another ‘open-sourced locations’ project could be to document uncommon study spots. For example, if McGill libraries were crowded I could escape to Purvis Hall’s solarium, which was usually empty.

If I missed a piano, shoot me an email and I will add it to the map (mobile-friendly link):

Piano D: The Steinway in the lecture room in the basement of the chapel.
'Piano' a: Technically not a piano, the carillon is a unique instrument that's close enough!
Piano N: The Yamaha in the Mathey college common room.

Thanks to Jordan Ash for his camera-lending abilities.

Princeton Pianos was originally published by Jaan Lı at Jaan Lı on January 18, 2014.

https://jaan.io/princeton-pianos
How to apply to grad school
Show full content

I gave an info session at McGill in March 2013 on applying to graduate schools in Canada, the U.S., and Britain. Here are some things I found most useful and what I wish I had known:

  • What is grad school? Matt Might might have the answer.
  • Decide where to apply: talk to professors at your school in the fields you are interested in and ask them what schools have the best research programs and people in those fields. If you have multiple interests or haven’t decided what you want to study, pick schools with research groups in a variety of fields.
  • Deadlines: write down every deadline of every school and fellowship you are considering. If you are applying to start in September, there will be deadlines as early as September the previous year, meaning you should aim to get those applications finished in August of the previous year. It is never too early to write down the deadlines of every school and fellowship you are applying for (e.g. the Rhodes, NSERC PGS, and Vanier deadlines are in September; Oxford’s first deadline is in October).
  • Requirements: figure out what you need to submit to the school to apply (viz. navigate the mazes of academic websites). A typical application consists of three letters of recommendation, a two-page statement of purpose (essay), CV, an application fee, and test scores (GRE, subject GRE, and TOEFL).
  • Apply to as many schools as you can - once you have one application done, additional applications don’t take much time. People typically apply to around ten schools (a few top schools, a few in the middle, and a few ‘safeties’ where admission is anticipated).
  • Do not worry about the cost of tests and applications - most programs will pay you a decent salary and you will readily make back what you spent (if you gain admission to just one program).
  • Apply to every scholarship and fellowship you are eligible for which will support your graduate studies. It is good practice writing the research statements and essays and you may even be successful.
The Application

Your application will be reviewed by a committee of faculty members (and sometimes senior graduate students). The majority of your graduate schooling consists of doing research – the most important thing you can do in your application is to demonstrate research ability. The best way to do this is to do summer research and work hard in the hopes of getting published and getting good recommendation letters attesting to your research potential. The next best way is to take research-based courses at your school.

Therefore, start doing research as early as possible. If you’re in high school, send me an email and I’ll try to give you a possible path to working in a lab. If you are in college:

  • Start talking to potential summer research supervisors in November of the previous year (in January of the same year at the latest). You can also apply to work at different schools - a bit harder, but could happen through same process. Look up professors you are interested in working with, and send them a short email with your LaTeXed CV (see below) asking about doing a summer project and in the area of their research that appeals to you. Read one of their papers and mention it in your email (their website may not be current; look up their latest papers on PubMed, the arXiv, or the Web of Science).
  • Apply for an REU if you are a US citizen, or an NSERC USRA if you are Canadian or to the Caltech SURF program if you are either or neither. The DAAD RISE program has internship opportunities in Germany Jan Gorzny at UToronto has a helpful page on the NSERC USRA.
  • Apply for every summer research scholarship such as the REU, DAAD RISE, Caltech SURF, or NSERC USRA even if you do not have the most competitive application and transcript, as the professor you apply with may decide to fund you through a separate grant if your initial funding application is not successful You may have to contact many professors before you find one willing to take you on - this is normal (I contacted around 30 faculty a year and the success rate was ~10%).
  • Take research courses, as electives or for your degree. For these you will also have to seek out professors to work with. At McGill such courses are the 396 research courses, and other possible routes are MATH 470 (Honours Research Project) or PHYS 459 (Honours Research Project or Thesis).
  • Before contacting professors, read Matt Might’s how to email post, use your official school email address, and Boomerang your emails to arrive at 3 PM on Wednesdays (see MailChimp’s email open rates summary).
  • If you cannot find a professor willing to take you on for the summer, consider volunteering in a lab for a few hours each week or take a research course to get your foot in the door. If you have LaTeXed your CV (see below), tried the above options, contacted a ton of professors, and have been unsuccessful in securing a summer or semester-long position, send me an email and I will do my best to tell you how to improve your application.
  • Persistence pays off with professors - if they don’t reply initially, show up at their office or send a follow up email. You can also attend local colloquia or talks in fields that interest you; approach professors after their talk to ask about opportunities at their school, get their card or contact info, and follow up via email.
  • Once you’re working in research, do your best to see your project through from start to finish (this may mean putting in extra, unpaid time).
Making your CV look good

When contacting professors with your CV, make sure it looks good - presentation makes a difference. Don’t use Microsoft Word. Don’t believe me that you should use LaTeX for your CV? Read this for an overview of the benefits.

To get your CV into LaTeX format, you can look online for CV templates - a good website is LaTeX Templates. Mike King also has a good intro to LaTeX.

Hosting your CV, setting up a website

Consider setting up a basic website with your CV and projects. You can do this with Google Sites or Wordpress (or Jekyll if you are comfortable with the command line).

At the very least, include a Dropbox link to your CV whenever you send it in an email. This way you can update your CV at any time and rest assured that the recipients will see the latest version.

Studying for the GRE and subject GRE

See here.

The Personal Statement

Read the guidelines for each school you are applying to - while typically they will ask you to elaborate on your research projects, courses, and future plans, some may ask about teaching or other specific things. If you are applying to many schools, you can use the same essay but change your ‘future plans’ section appropriately. Some tips:

  • Even if you are not sure of what field you are interested in, pick something that sounds interesting and fits your background and stick to it. If you are deciding between theory and experiment, pick experiment, as it is very difficult to get accepted for theory these days, and most end up switching to experiment anyway. The most convincing essay will typically be the one where you appear most sure of what you want to study.
  • You are not bound by your statement - you will typically do rotations in three to four groups before deciding on an advisor.
  • Write down the names of a few professors at the school you are writing the statement for; they will typically be the ones reviewing your case. Make sure your background matches their research program, and that you state which aspect of their research you are interested in.
  • Write as many concrete examples of projects you did, or things that make you stand out. Have any motivation be as concise as possible (e.g. avoid the ubiquitous “Ever since grade school I knew I wanted to study [insert subject here].”)
  • Here are two sample essays:
Letters of Recommendation

Matt Might has good advice on this, as does Chris Blattman. Alex Maloney’s instructions are good and would apply to any professor you request letters of recommendations from.

Scholarships and Fellowships

Apply to every scholarship and fellowship for which you are eligible. For me, this included the Rhodes Scholarship (I strongly encourage you to apply: the interviews are nerve-wracking and great practice), Commonwealth Scholarship, NSERC PGSM, Vanier Canada Scholarship, Fulbright Scholarship, Mackenzie Scholarship (McGill link), Delta Upsilon Scholarship (McGill only), and Moyse Travelling Scholarship (McGill only). Lesser known sources of scholarships that may have many fewer applicants are offered through professional organizations such as the SPIE, IEE, etc. (see below).

Attend conferences, try a semester abroad, join professional associations

If you have done research, make a poster and present it at a conference or meeting, regardless of whether you confirmed your hypothesis. Conferences typically have funding you can apply for, and your school may have funds like McGill’s Ambassador Fund to enable students to attend conferences. Additional sources of funding include professional associations, which typically provide free membership for undergraduates. Examples are the Canadian Association of Physicists, Institute of Physics (IOP), SPIE, IEEE, and Society of Women Engineers.

Contact professors before (and after) applying

Send emails to the professors you are interested in working with at the schools you are applying to, ideally well before actually applying. This serves several purposes: you can find out if they will be taking new students or not (this is important - if you only list faculty who are not taking students on your personal statement you are likely to be rejected). Furthermore, you will be able to include your CV if it is not asked for on the official application. To do this, read or skim their latest papers from their website, PubMed, the arXiv, or the ISI Web of Knowledge, and mention which areas of their research interest you. Read Matt Might’s how to email, use your official school email address, add a direct link to your hosted CV (see above), and Boomerang your emails to arrive at 9 AM on a Wednesday. Getting personal replies from profs makes a huge difference and can make the process feel much more personal (as well as being good motivation to grind through the months).

Your final-year grades don’t matter that much (and some good courses to take)

Caveat: they do matter if you plan on doing a Masters and then applying to PhD programs or possibly working in industry, and it’s never bad to have a high GPA.

However, between courses, GREs, and applications it is easy to get burnt out, so don’t sweat your grades if you get overwhelmed. Try to plan your courses to maximize your grades (GPA) for the first three years of your undergrad, as grad schools will not see the fall semester grades of your senior year (you apply in December and grades come out in January by which time you will already be hearing back). Alongside research courses, try doing your undergrad thesis in your third year even if it’s normally taken in your senior year (if you work hard you’ll have a publication to list on your application, a good letter of recommendation, and some A’s). Also consider taking scientific writing courses such as McGill’s CEAP 250 course - these will improve your writing and typically culminate in a final report which you could also submit for publication.

Further reading

Other webpages and resources I found useful:

What did I miss?

Shoot me an email with any recommendations or tips.

How to apply to grad school was originally published by Jaan Lı at Jaan Lı on September 01, 2013.

https://jaan.io/how-to-apply-to-grad-school
How to ace the GRE and Physics GRE
Show full content

Most PhD programs in the US require the GRE general test. These scores matter, so start studying early and register at ETS to take the tests as early as possible.

  • For the general GRE test, I used and highly recommend The Princeton Review’s Cracking the GRE 2013 edition. It is excellent preparation and includes plenty of practice tests.
  • Don’t get an older version (e.g. the 2012 version), but get the 2014 edition.
  • ETS periodically adds new material to the tests so you want the most up-to-date book. You can take a computer-based or hand-written version of the test; I recommend the handwritten version as it is easier to make notes and work out the problems on the test booklet itself than on paper beside the computer.

The Physics GRE subject test is much more involved, and much more difficult. There are no omnipotent books from The Princeton Review, and you have to do the best you can using a variety of sources. Your Physics GRE score matters a lot if you are applying to Physics programs (especially if you are an international student, as some schools have cutoffs). Successful preparation in one sentence would be: do the 500 practice problems found online (links below) and understand the solutions and how to do them as fast as possible. In summary:

  • Your goal should not necessarily be to understand the material; it should be to ace this test in as little time as possible (not even ace – there is a heavy curve so an excellent score is typically 80 questions correct out of 100).
  • You have very little time per question so once you’ve understood a question, check the solutions websites if a comment gives a faster method of solving it by taking limits or physical intuition. Memorize the formulas you see popping up again and again (see formula sheets below).
  • Focus all your effort on writing the practice tests and reviewing the questions. Do not rely heavily or spend much time (if any) on an official ‘How to prepare for the Physics GRE’ book – the material is too extensive to be condensed into this form. The 2012 Physics GRE was very similar to the 2008 test, so it pays to ensure you understand and can quickly do all 500 practice questions. This will take time so start early.
  • Preferably take the April test (register here) so you can write the October test if you need to (if you write the November test you will not have a chance to retake it, and if you take the October test you will not receive your scores by the November test).
  • The old tests from the 90s are much harder than the current versions, so if you are short on time focus more on the most recent practice tests. Here are links to the five practice tests found online, with online solutions. IMPORTANT: read the comments of the online solutions, as they frequently give ways of solving problems much faster.

  • The awesome folks at Case Western Reserve University will send you free (FREE!) flash cards for the Physics GRE, all you have to do is send your address to physicsgreflashcards@phys.cwru.edu and they will put them in the mail (also have a look at their website).

  • Two former MIT students wrote a decent book attempting to summarize the material (I used a few chapters of this to review material).
  • Here are links to some formula sheets:
  • Additional ‘how to prepare for the Physics GRE pages’:

Is anything in this article confusing, are links out of date, or can this be improved? Please submit a pull request, tweet me, or email me :)

How to ace the GRE and Physics GRE was originally published by Jaan Lı at Jaan Lı on August 31, 2013.

https://jaan.io/how-to-ace-the-gre-and-physics-gre
CANImmunize
Show full content

I was lucky to be the first UI/UX designer for Canada’s national vaccinations app. Currently 140k+ users.

Featured on the CBC, won award for “using wireless technology to improve the lives of Canadians”.

CANImmunize was originally published by Jaan Lı at Jaan Lı on August 15, 2013.

https://jaan.io/immunize-canada-app
Smoked salmon openfacer
Show full content

This sandwich is an all-time classic hall-of-famer mainstay in my family. It’s a classy breakfast, lunch, or dinner, and is great for hosting (guests can make their own sandwiches). Just brush your teeth after consumption.

Ingredients
  • White bread with sesame seeds, well-toasted
  • Unsalted butter
  • Fresh smoked salmon, sliced [1]
  • Lemon
  • Capers or thinly sliced caperberries
  • Green onions (shallots or white onions are good substitutes)
  • Olive oil
  • Sea salt flakes
  • Freshly ground pepper
Instructions

Toast the bread. Finely dice the shallots, onions, or green onions.

Liberally butter the toasted bread, and cover the entirety of the toast’s surface with a single layer of smoked salmon. Sprinkle the sheet of salmon with lemon juice.

Evenly cover with the diced shallots and capers in about a 3:1 ratio of onions to capers.

Pour a decent amount (about 1 tablespoon) of olive oil onto the sandwich. Scatter sea salt flakes and finish it with some ground pepper.

This is easiest to eat with knife and fork due to the olive oil overdose. Serve with a glass of wine or whole milk, depending on time of day.

Thank you Mike King for the food photography, Joel Ryan for coining ‘openfacers’, and Frank Megna for the slogan! This post originally appeared at openfacers.com, the now-defunct food blog focused solely on open-faced sandwiches.

  1. Aim to buy fresh ‘chunk’ sushi-grade smoked salmon (ideally, never frozen) from your local fishmonger. Slice it with a sharp knife to your preferred thickness. This allows you to achiveve thinner slices than the pre-sliced smoked salmon from normal grocery stores. 

Smoked salmon openfacer was originally published by Jaan Lı at Jaan Lı on July 19, 2013.

https://jaan.io/smoked-salmon-open-faced-sandwich