Adversarial Inverse Graphics Networks

AIGN

Posted by Tab on November 26, 2017

Outline

The paper proposes the Adversarial Inverse Graphics networks (AIGNs) for weakly supervised learning:

  • Combining feedback from rendering predictions
  • With distribution matching between their predictions and a collection of ground-truth factors
  • And apply them to some 3D tasks and facial image transformation

Figure 1: Some examples of AIGN

Background

  • Feed-forward models
  • Unpaired supervision
  • About tasks included in this paper
    • 3D human pose estimation
    • 3D structure and egomotion estimation
    • Super resolution
    • Inpainting

Related Work

  • Inverse-graphics networks
    • 3D interpreter networks
  • 3D interpreter networks
    • Reconstruction loss without adversarial loss
  • GAN
    • Adversarial loss and L2 loss without reconstruction loss
    • Annotated pairs supervision

Algorithm

Overview

First, the loss function of our network consists of two parts (reconstruction loss and adversarial loss):

3D Human Pose Estimation

Figure 2: Overview of AIGN for 3D Human Pose Estimation

Figure 3: Details of AIGN for 3D Human Pose Estimation

Note:For 3D Human Pose Estimation, we need to predict 3D Pose according to a given 2D picture. So generator needs to predict 3D from 2D (Human keypoints heatmaps), while the projection needs to reconstruct 2D heatmaps from predicted 3D results.

For different tasks, there exist different methods of projection. For this task, we use some graphics theory.

The predicted 3D results can be decomposed into some components:

$x_{3D}=R\sum_{j=1}^{|B|}\alpha_jB_j$

where, $B_j$ is one of the bases of human 3D structure and the paper chose 60 from 96 bases with PCA.

Then the projection can be figured out by:

$x_{2D}^{proj}=Px_{3D}+\begin{bmatrix} c_{x}; c_{y}; 0 \end{bmatrix}$ where $P=\begin{bmatrix} f & 0 & 0 & 0 ; 0 & f & 0 & 0; 0 & 0 & 1 & 0 \end{bmatrix}$

Structure from Motion

Figure 4: Overview of AIGN for 'structure from motion'

Figure 5: Details of AIGN for 'structure from motion'

Note:For Structure from Motion, we need to predict camera motion and depth according to given consecutive two frames. So generator needs to predict R, t and depth, while the projection needs to reconstruct t+1 frame picture from predicted R, t and depth.

For this task, we use some other graphics theory.

The projection is based on the optical flow theory:

$X_1^i=\begin{bmatrix} X_1^i ; Y_1^i ; Z_1^i \end{bmatrix}=\frac{d^i}{f}\begin{bmatrix} x_1^i-c_x ; y_1^i-c_y ; f \end{bmatrix}$

$X_2=RX_1+T$

where $x,y,z$ refer to the pixel in the optical flow, and $X$ refers to initial picture.

Then we can minimize the reconstruction loss (as well as adversarial loss):

$L^{photo}=\frac{1}{wh}\sum_{x,y}||I_1(x,y)-I_2(x+U(x,y),y+V(x,y))||_1 $

according to the optical flow pixel value doesn’t change approximately between two consecutive frames.

Super resolution

Figure 6: Overview of AIGN for super resolution

Figure 7: Details of AIGN for super resolution

Note:For super resolution, we need to predict the high resolution photo. So generator needs to predict a high resolution photo, while the projection needs to reconstruct the low resolution one.

For this task, because it’s easy to get access to paired data, so the paper focuses on the bias in the memory, or we can consider the bias as a kind of condition. That is, if we give the memory some constraints, such as make the memory only includes some photos with big noses, then the generated image is likely to ‘possess’ this feature.

Inpainting

Figure 8: Overview of AIGN for inpainting

Figure 9: Details of AIGN for inpainting

Note:For image inpainting, we need to predict the complete photo. So generator needs to predict a complete photo, while the projection needs to reconstruct the masked one.

Like the former task, because it’s easy to get access to paired data, so the paper focuses on the bias in the memory, or we can consider the bias as a kind of condition. Also, if we give the memory some constraints, such as make the memory only includes some photos with big noses, and the mask corresponds with the location of nose, then the generated image is likely to ‘possess’ this feature.


Experiment

1. 3D Human Pose Estimation

  • Human3.6M dataset
  • Compare 3 supervised learning methods
    • Forward2Dto3D
    • 3D interpreter
    • AIGN

Figure 10: 3D reconstruction error in H3.6M of different methods

Figure 11: Samples of predicted 3D human poses

2. Structure from Motion

  • Dataset: Virtual KITTI(VKITTI)
  • 50 high-resolution monocular videos (21260 frames)

Figure 12: Samples of predicted depth and optical flow

Figure 13: Loss curves for different tasks with different methods

Figure 14: Comparison of different methods for camera motion estimation

3. Image-to-image translation

  • CeleA dataset
  • 202599 face images, 10177 unique identities and 40 binary attributes

Figure 15: Biased adversarial super-resolution

Figure 16: Biased adversarial inpainting

Discussion

  • Adversarial loss (so that image can seems to be real)
  • Image-to-image translation (why not generate after blurring it)
  • Self-supervised learning on validation dataset will be more reasonable than on test dataset in the paper
  • About the decoder (non-param render is better than param one because the latter one can be hard to control its learning)

Figure 17: Comparison between non-param and param render

Reference

  1. Adversarial Inverse Graphics Networks: Learning 2D-To-3D Lifting and Image-To-Image Translation From Unpaired Supervision