Introduction to Variational Autoencoders (VAEs)
Bayesian inference in complex models has always been a though task to achieve especially if there are also model parameters to be estimated in the model definition. Although Expectation-Maximization (EM) algorithm, which is based on the alternating optimization of variational lower bound and model parameters, serves a good framework, most of the time we are not capable of evaluating the exact posterior of latent variables. In this case, approximations to the exact posteriors are made either via sampling based Monte Carlo methods or optimization based variational inference. The letter one is faster but mostly limited to the models that are carefully designed by considering conditional conjugacies. This was discouraging researchers to design models that are parameterized with neural networks until VAEs are introduced.
For a more detailed explenation I suggest reader to read my medium article here. Here I want to emphasize the most critical points and summarize the whole idea with few words: Assume a model with latent variable $z$, observed variable $x$ and a likelihood that is parameterized with neural network(s). A simple example for such a likelihood would be $p_{\theta}(x|z) = \mathcal{N}(x;\mu_{\theta}(z),1)$ where $\mu_{\theta}$ is a NN. Using a recognition distribution which is conditioned on observation $x$, we can write the ELBO (Evidence Lover BOund) as the following:
\[L(\theta,\phi) = \mathbb{E}_q\left[\log p_{\theta}(x|z)\right] - KL(q_{\phi}(z|x)||p(z))\]where $\phi$ is variational parameters for pre-defined family distribution $q(.)$. By pre-defining the family of recognition distribution, we transform a functional objective to a function. Now, we can use another NN to parameterize recognition distribution.
Also notice that it is hard to evaluate both ELBO and its gradient. We might consider to resort to use noisy gradients within a stochastic optimization setting but the first term prevent us to do since the expectation is with respect to $q_{\phi}(z|x)$ which includes the parameters $\phi$. Here a reparameterization trick comes to help such that the same expectation can be taken with respect to another distribution of dummy random variable that doesn’t depend on $\phi$. Now we can write the gradient of an expectation as the expectation of gradient term which renders stochastic optimization possible.
VAEs have already led to many advances in machine learning field by enabling variational inference in complex models which can also scale to big datasets. The parameters of a complex model can be learnt in an online manner thanks to stochastic training setting and number of parameters to be estimated for the inference scale with the number of weights in the recognition network not with the size of the dataset. This means that optimization is carried out in a lower dimensional space. Moreover inference for unseen instances can be made easily due to amortized inference. The only thing we need to do is to run the recognition network forward for the new instance and this process is really fast since it is nothing but a series of matrix multiplications.