Jensen's inequality is a powerful tool often used in mathematical derivations and analyses. It states that for a convex function $f(x)$ and an arbitrary random variable $X$ we have the following upper bound:
$$
f\left(\E X\right)
\le
\E f\left(X\right)
$$
Deep RL is hot these days. It's one of the most popular topics in the submissions at NeurIPS / ICLR / ICML and other ML conferences. And while the definition of RL is pretty general, in this note I'd argue that the famous REINFORCE algorithm alone is not enough to label your …
I have been looking at $f$-GANs derivation doing some of my research, and found an easier way to derive its lower bound, without invoking convex conjugate functions.
$f$-GANs are a generalization of standard GANs to arbitrary $f$-divergence. Given a convex function $f$, $f$-divergence, in turn, can …
In this post I'd like to show how Self-Normalized Importance Sampling (IWHVI and IWAE) and Annealed Importance Sampling can be used to give (sometimes sandwich) bounds on the MI in many different cases.
Mutual Information (MI) is an important concept from the Information Theory that captures the idea of information …
This post sets background for the upcoming post on my work on more efficient use of neural samplers for Variational Inference.
Variational Inference
At the core of Bayesian Inference lies the well-known Bayes' theorem, relating our prior beliefs $p(z)$ with those obtained after observing some data $x$:
Unfortunately, these methods don't work for discrete random variables. Moreover, it looks like there's no way to backpropagate through discrete stochastic nodes, as …
Last year I covered some modern Variational Inference theory. These methods are often used in conjunction with Deep Neural Networks to form deep generative models (VAE, for example) or to enrich deterministic models with stochastic control, which leads to better exploration. Or you might be interested in amortized inference.
The more I talk to people online, the more I hear about the famous No Free Lunch Theorem (NFL theorem). Unfortunately, quite often people don't really understand what the theorem is about, and what its implications are. In this post I'd like to share my view on the NFL theorem …
Many tasks of machine learning can be posed as optimization problems. One comes up with a parametric model, defines a loss function, and then minimizes it in order to learn optimal parameters. One very powerful tool of optimization theory is the use of smooth (differentiable) functions: those that can be …
postsmachine learningdeep learningvariational inferencemodern variational inference series
Previously we covered Variational Autoencoders (VAE) — popular inference tool based on neural networks. In this post we'll consider, a followup work from Torronto by Y. Burda, R. Grosse and R. Salakhutdinov, Importance Weighted Autoencoders (IWAE). The crucial contribution of this work is introduction of a new lower-bound on the marginal …
postsmachine learningdeep learningvariational inferencemodern variational inference series
So far we had a little of "neural" in our VI methods. Now it's time to fix it, as we're going to consider Variational Autoencoders (VAE), a paper by D. Kingma and M. Welling, which made a lot of buzz in ML community. It has 2 main contributions: a new …
postsmachine learningdeep learningvariational inferencemodern variational inference series
In the previous post we covered Stochastic VI: an efficient and scalable variational inference method for exponential family models. However, there're many more distributions than those belonging to the exponential family. Inference in these cases requires significant amount of model analysis. In this post we consider Black Box Variational Inference …
postsmachine learningdeep learningvariational inferencemodern variational inference series
In the previous post I covered well-established classical theory developed in early 2000-s. Since then technology has made huge progress: now we have much more data, and a great need to process it and process it fast. In big data era we have huge datasets, and can not afford too …
postsmachine learningdeep learningvariational inferencemodern variational inference series
As a member of Bayesian methods research group I'm heavily interested in Bayesian approach to machine learning. One of the strengths of this approach is ability to work with hidden (unobserved) variables which are interpretable. This power however comes at a cost of generally intractable exact inference, which limits the …
During work on my machine learning project I was needed to perform some quite computation-heavy calculations several times — each time with a bit different inputs. These calculations were CPU and memory bound, so just spawning them all at once would just slow down overall running time because of increased amount …
It's well known that lower bound for sorting problem (in general case) is
$\Omega(n \log n)$. The proof I was taught is somewhat involved and is
based on paths in "decision" trees. Recently I've discovered an
information-theoretic approach (or reformulation) to that proof.
Once upon a time I was asked (well, actually a question wasn't for me only, but for whole habrahabr's community) is it possible to implement namespaced methods in JavaScript for built-in types like:
5..rubish.times(function() { // this function will be called 5 times
console.log("Hi there!");
});
"some string …
Recently I've read an article Efficient Memoization using Partial Function Application. Author explains function memoization using partial application. When I was reading the article, I thought "Hmmm, can I come up with a more general solution?" And as suggested in comments, one can use variadic templates to achieve it. So …
Sometime ago when Facebook opensourced their Folly library I was reading their docs and found something interesting. In section "Memory Handling" they state
In fact it can be mathematically proven that a growth factor of 2 is rigorously the worst possible because it never allows the vector to reuse any …