Formation

Maximum Likelihood — Find $\theta$ to maximize $P(X)$, where $X$ is the data. $z \sim P(z)$, which we can sample from

Naive solution

Approximate with samples of z

Problems

Need a lot of samples of z and most of the P(X|z) ≈ 0
Not practical computationally

Key question 1

Is it possible to know which $z$ will generate $P(X|z) » 0$? Solution: Learn a distribution $Q(z)$, where $z \sim Q(z)$ generates $P(X|z) » 0$

Detail

Assume we can learn a distribution $Q(z)$, where $z \sim Q(z)$ generates $P(X|z) » 0$ We want $P(X) = E_{z \sim P(z)}P(X|z)$, but not practical. We can compute $E_{z \sim Q(z)}P(X|z)$, more practical. How does $E_{z \sim Q(z)}P(X|z)$ and $P(X)$ relate? Using KL Divergence finally we will get

KL divergence is always > 0.
$\text{log} P(X) > \text{log} P(X) - D[Q(z) || P(z|X)] $
Maximize the lower bound instead

Key question 2

How do we get $Q(z)$ ?

Using $Q(z|X)$ instead
Model $Q(z|X)$ with a neural network
Assume $Q(z|X)$ to be Gaussian, $N(\mu, c⋅I)$
- Neural network outputs the mean $\mu$, and diagonal covariance matrix $c⋅I$
- Input: Image, Output: Distribution

Loss

Convert the lower bound to a loss function

Model $P(X|z)$ with a neural network, let $f(z)$ be the network output.
Assume $P(X|z)$ to be i.i.d. Gaussian
- $X = f(z) + \epsilon$ , where $\epsilon \sim N(0,I)$
- Simplifies to an L2 loss: $||X-f(z)||^2 $
Assume $P(z) \sim N(0,I)$ then $D[Q(z|X) || P(z)]$ has a closed form solution

Framework

Reparametrization Trick : enable bp to encoder

$z \sim N(\mu, \sigma)$ is equivalent to $\mu+\sigma⋅\epsilon$, where $\epsilon \sim N(0, 1)$

Training

Repeat till convergence: $X^M$ <– Random minibatch of M examples from X $\epsilon$ <– Sample M noise vectors from $N(0, I)$ Compute Loss L (i.e. run a forward pass in the neural network) Gradient descent on L to updated Encoder and Decoder.

Testing

Sample $z \sim N(0,I)$ and pass it through the Decoder

VAE Tutorial

VAE