Why are we using Reparameterization Trick in VAE?
July 07, 2022

In my recent research project, I found a super interesting problem – people are using reparameterization trick when implementing VAE without even explaining it. It confused me a lot because I have never learned that before, and nobody said that, too. So I decide to do some internet search and hope to find an undergrad-level explanation.

Well, things are not going great. Most of online answers are purely based on how it could solve the problem rather than why it solves the problem. Or, I can say they are mostly explaining some hard-core proofs, like Gregory Gundersen said in the blog. I have also checked the original paper. It looked beautiful, but helped me nothing.

After reading through Kingma's NIPS 2015 workshop slides, I realized that we need the reparameterization trick in order to backpropagate through a random node.
Intuitively, in its original form, VAEs sample from a random node $z$ which is approximated by the parametric model $q(z|\phi , x)$ of the true posterior. Backprop cannot flow through a random node.
Introducing a new parameter $\epsilon$ allows us to reparameterize $z$ in a way that allows backprop to flow through the deterministic nodes.

It somehow explains the main reason of why we are using reparameterization trick, but I feel like there is still something missing – why $z$ is originally not usable? I know it is random, but why does it not turn deterministic after being generated?

And eventually, I found this answer explaining Reparameterization Trick in a friendly way.

By default, we are using random variable $z$ in VAE (a neural network). When training, we need to do backpropagation, so it turns into a problem that random variable $z$ cannot be used in backpropagation directly as we are unable to extract $\mu$ and $\sigma$ from $z\sim N(\mu,\sigma^2)$. In order to do the backpropagation respect to $\mu$ and $\sigma$, we have to combine the final $z$ with pre-known $\mu$ and $\sigma$, thus the reparameterization trick comes up.
We are now dividing $z$ into three parts: $\mu$, $\sigma$, and $\epsilon$, where $\epsilon$ is a random variable. Even though $\epsilon$ is still not deterministic, we are now able to extract $\mu$ and $\sigma$ from $z$ because we do not need to know the value of $\epsilon$, so backpropagation becomes possible, and $\mu$ and $\sigma$ become something we are going to set as parallel layers.