Variational inference basics

1. Basic maths
2. Variational families
3. Conclusions

I mentioned in a previous post that I would take a look at variational inference, so here we go.

1 Basic maths

Variational inference (VI) is a method for approximate Bayesian inference. One might want to use VI instead of MCMC (which less Monte Carlo error provides exact inference) when models become complex or datasets become large - as in these cases MCMC can take too long. In other words, VI is faster but not as accurate.

A good reference for VI is (in my opinion) Blei et al. (2018) which describes how a probabilistic model defines the joint distribution of observations \(x\) and latent variables \(z\):

\begin{equation*} p(z, x) = p(z) p(x | z) \end{equation*}

and that the problem in Bayesian inference is to compute the posterior \(p(z | x)\). Rather than using MCMC/sampling to compute this posterior, VI poses the problem as an optimisation one.

As then described in Kucukelbir et al. (2016), consider a family of distributions denoted \(q(z | \phi)\) parametrised by vector \(\phi \in \Phi\) which approximate \(p(z | x)\), the idea of VI is then to find the best approximation of the posterior via:

\begin{equation*} \phi^* = \underset{\phi \in \Phi}{argmin} KL(q(z | \phi) || p(z | x)) \end{equation*}

i.e. find \(\phi^*\) which minimises the Kullback Leibler divergence, which in short is an asymmetric measure of the difference in two probability distributions.

Unfortunately, we can't really compute the KL divergence because it involves the posterior, and if we could compute that then we probably wouldn't be looking to sampling or approximate inference methods. Instead, the problem is changed to maximising what is known as the evidence lower bound (ELBO). Let's see if we can derive it starting from the KL divergence formula on wikipedia:

\begin{align*} KL(q(z | \phi) || p(z | x)) &= \int_{-\infty}^{\infty} q(z | \phi) \log\left(\frac{q(z | \phi)}{p(z | x)}\right) dz \\ &= \int_{-\infty}^{\infty} q(z | \phi) \log\left(\frac{q(z | \phi) p(x)}{p(z, x)}\right) dz \\ &= \int_{-\infty}^{\infty} q(z | \phi) (\log(q(z | \phi)) - \log(p(z, x) + \log(p(x)))) dz \\ &= \int_{-\infty}^{\infty} q(z | \phi) \log(q(z | \phi) dz - \int_{-\infty}^{\infty} q(z | \phi) \log(p(z, x)) dz + \int_{-\infty}^{\infty} q(z | \phi) \log(p(x)) dz \\ &= \int_{-\infty}^{\infty} q(z | \phi) \log(q(z | \phi) dz - \int_{-\infty}^{\infty} q(z | \phi) \log(p(z, x)) dz + \log(p(x)) \int_{-\infty}^{\infty} q(z | \phi) dz \\ &= \mathbb{E}_{q(z | \phi)}(\log(q(z | \phi)) - \mathbb{E}_{q(z | \phi)}(\log(p(z, x))) + \log(p(x)) . \end{align*}

This shows the dependence on the evidence \(\log(p(x))\) which we can't compute, MCMC methods avoid it and VI needs to as well. Instead consider:

\begin{align*} \log(p(x)) = KL(q(z | \phi) || p(z | x)) + \mathbb{E}_{q(z | \phi)}(\log(p(z, x))) - \mathbb{E}_{q(z | \phi)}(\log(q(z | \phi)) \end{align*}

and note that:

\begin{equation*} KL(q(z | \phi) || p(z | x)) \geq 0 \end{equation*}

by Gibbs' inequality, with equality to 0 when \(q(z | \phi) = p(z | x)\). This means if we replace the KL divergence term with 0, we obtain a lower bound for the evidence (the ELBO previously mentioned):

\begin{align*} \log(p(x)) \geq \mathbb{E}_{q(z | \phi)}(\log(p(z, x))) -\mathbb{E}_{q(z | \phi)}(\log(q(z | \phi)) = ELBO(\phi). \end{align*}

As per Blei et al. (2018), we can look more closely at the characteristics of the ELBO via:

\begin{align*} ELBO(\phi) &= \mathbb{E}_{q(z | \phi)}(\log(p(z, x))) -\mathbb{E}_{q(z | \phi)}(\log(q(z | \phi)) \\ &= \mathbb{E}_{q(z | \phi)}(\log(p(x | z))) + \mathbb{E}_{q(z | \phi)}(\log(p(z))) - \mathbb{E}_{q(z | \phi)}(\log(q(z | \phi)) \\ &= \mathbb{E}_{q(z | \phi)}(\log(p(x | z))) - (\mathbb{E}_{q(z | \phi)}(\log(q(z | \phi))) - \mathbb{E}_{q(z | \phi)}(\log(p(z)))) \\ &= \mathbb{E}_{q(z | \phi)}(\log(p(x | z))) - KL(q(z | \phi) || p(z)) \end{align*}

thus, maximising the ELBO involes \(q(z | \phi)\) placing mass on values of \(z\) which give high log likelihood \(p(x | z)\) but not too far from the prior \(p(z)\).

2 Variational families

Now we glance over the two most popular (as far as I know) approximating distributions. The former is faster while the latter is able to give a better approximation.

2.1 Mean-field Gaussian

This family approximates the posterior with independent Gaussian factors:

\begin{equation*} q(z | \phi) = \Pi_{i=1}^m \mathcal{N}(z_i | \mu_i, \sigma_i^2) \end{equation*}

and so \(\phi = (\mu_1, \ldots, \mu_m, \sigma_1^2, \ldots, \sigma_m^2)\).

2.2 Full-rank Gaussian

This family approximates the posterior with a multivariate Gaussian distribution:

\begin{equation*} q(z | \phi) = \mathcal{N}(z | \mu, \Sigma) \end{equation*}

with \(\phi = (\mu, \Sigma)\).

2.3 Recommendations

Taken from Kucukelbir et al. (2016):

How to choose between full-rank and mean-field ADVI? Scientists interested in posterior variances and covariances should use the full-rank approximation. Full-rank ADVI captures posterior correlations, in turn producing more accurate marginal variance estimates. For large data, however, full-rank ADVI can be prohibitively slow. Scientists interested in prediction should initially rely on the mean-field approximation. Mean-field ADVI offers a fast algorithm for approximating the posterior mean. In practice, accurate posterior mean estimates dominate predictive accuracy; underestimating marginal variances matters less.

where ADVI is automatic differentiation variational inference - a method which they discuss in their paper.

3 Conclusions

This was just a few of the basic derivations to start VI. The next steps involve maximising the ELBO - which is unfortunately not straightforward. Fortunately though, there are many general implementations of methods for doing so available in stan, pyro, and tensorflow-probability (although I can't see much documentation about how to do it with tensorflow-probability and found this issue saying there wasn't any).

I'm hoping to look over the details of these methods and implementations and see how they perform later.

Table of Contents