Outline
Generative Model
- Density estimation
- Sample generation
Motivation
- High-dimensional probability distribution
- incorporated into reinforcement learning
- predict missing data (semi-supervised learning)
- multi-model output
- realistic generation of samples
Figure 1: In this example, a model is trained to predict the next frame in a video sequence. Here the ‘multi-model’ means that the possible result is not a single answer. You can see the second image uses MSE to form an answer with an average over many slightly different images. However, with an additional GAN loss, the third image can understand that there are many possible outputs and each of them is sharp, recognizable and realistic.
Related Work
- FVBN: In parallel or not?
- ICA: Generator functions has few restrictions?
- Boltzmann machine: Markov chains needed?
- VAE: Variational bound needed?
- Quality of samples?
Figure 2: An illustration about alternative methods of generative models
FVBN Figure 3: Fully Visible Belief Nets with great generation cost
ICA
Figure 4: ICA puts too much restrictions to generator functions
VAE
Figure 5: VAE is not asymptotically consistent unless q is perfect
Algorithm
The GAN framework
Figure 6: Generater vs. discriminator
Cost function
The discriminator’s cost
$J^{(D)}((\theta)^{(D)},(\theta)^{(G)})=-\cfrac{1}{2}E_{x\sim p_{data}}logD(x)-\cfrac{1}{2}E_{z}logD(G(z))$
The generator’s cost
- Minimax (JS divergence)
- Maximum likelihood game (KL divergence)
- Heuristic, non-saturating game
Minimax (JS divergence) Figure 7: Generator’s minimax cost: $J^{(G)}=-J^{(D)}$
Maximum likelihood game (KL Divergence)
$J^{(G)}=-\cfrac{1}{2}E_ze^{\sigma^{-1}(D(G(z)))}$
Heuristic, non-saturating game
$J^{(G)}=-\cfrac{1}{2}E_zlogD(G(z))$
A significant problem
Non-convergence
This problem usually happen when the ability of discriminator is great. (So the gradient of V is always vanishing) It is obvious that when D(G(z)) is near zero, the gradient is also near zero. Figure 8: Convergence performance of different methods Figure 9: Convexity of different methods
Mode collapse Figure 10: Example of mode collapse, which can be avoided by unrolled GAN Figure 11: Different results with different orders of p and q in KL divergence
Experiment
Evaluation of generative models Figure 12: Evaluation of generative models. The two figures correspond with MLP and DCGAN, which shows that the cost is not pretty accurate according to the quality of pictures.
Dicrete outputs
Frontier
- Semi-supervised learning
- Using this code
Discussion
Tips and Tricks
Train with labels
- Generating condition on labels: P(y|x)
- Training with labels: P(x, y)
One-sided label smoothing
(to reduce vulnerability to adversarial examples)
- Default discriminator cost
cross_entropy(1., discriminator(data))+cross_entropy(0.,discriminator(samples))
- One-sided label smooth cost
cross_entropy(.9, discriminator(data))+cross_entropy(0.,discriminator(samples))
- Do not smooth negative labels
$D^*(x)=\cfrac{(1-\alpha)p_{data}(x)+\beta p_{model}(x)}{p_{data}(x)+p_{model}(x)}$
According to the discriminator’s loss, the best discriminator should implement the function $D^*(x)$. So if we only decrease the coefficient of $p_{data}$, it’ll just have the confidence of D decline. However, if we get a positive $\beta$, it is likely that there exists a peak (local optimum value) which leads to a trend of generating wrong samples.
Virtual Batch Normalization
Figure13: correlation of samples in a batch caused by batch normalization
- Batch Norm (cause correlation of pictures in the single batch)
- Reference Norm (Fix a reference batch R for normalization)
- Virtual Batch Norm (avoid overfit in Reference Norm)
Figure14: Steps of Virtual Batch Normalization
Balance G and D?
- D always performs better(in accordance with theory)
- The network of D is bigger and deeper
- Do not limit D when it is too accurate
Reference
-
Previous
Conditional Variational Encoder -
Next
Deep Bilateral Learning for real-time image enhancement(Siggraph 2017)