Outline

The paper proposes the Adversarial Inverse Graphics networks (AIGNs) for weakly supervised learning:

Combining feedback from rendering predictions

With distribution matching between their predictions and a collection of ground-truth factors

And apply them to some 3D tasks and facial image transformation

Figure 1: Some examples of AIGN

Background

Feed-forward models

Unpaired supervision

About tasks included in this paper

3D human pose estimation

3D structure and egomotion estimation

Super resolution

Inpainting

Inverse-graphics networks

3D interpreter networks

3D interpreter networks

Reconstruction loss without adversarial loss

GAN

Adversarial loss and L2 loss without reconstruction loss

Annotated pairs supervision

Algorithm

Overview

First, the loss function of our network consists of two parts (reconstruction loss and adversarial loss):

3D Human Pose Estimation

Figure 2: Overview of AIGN for 3D Human Pose Estimation

Figure 3: Details of AIGN for 3D Human Pose Estimation

Note:For 3D Human Pose Estimation, we need to predict 3D Pose according to a given 2D picture. So generator needs to predict 3D from 2D (Human keypoints heatmaps), while the projection needs to reconstruct 2D heatmaps from predicted 3D results.

For different tasks, there exist different methods of projection. For this task, we use some graphics theory.

The predicted 3D results can be decomposed into some components:

$x_{3D}=R\sum_{j=1}^{|B|}\alpha_jB_j$

where, $B_j$ is one of the bases of human 3D structure and the paper chose 60 from 96 bases with PCA.

Then the projection can be figured out by:

$x_{2D}^{proj}=Px_{3D}+\begin{bmatrix} c_{x}; c_{y}; 0 \end{bmatrix}$ where $P=\begin{bmatrix} f & 0 & 0 & 0 ; 0 & f & 0 & 0; 0 & 0 & 1 & 0 \end{bmatrix}$

Structure from Motion

Figure 4: Overview of AIGN for 'structure from motion'

Figure 5: Details of AIGN for 'structure from motion'

Note:For Structure from Motion, we need to predict camera motion and depth according to given consecutive two frames. So generator needs to predict R, t and depth, while the projection needs to reconstruct t+1 frame picture from predicted R, t and depth.

For this task, we use some other graphics theory.

The projection is based on the optical flow theory:

$X_1^i=\begin{bmatrix} X_1^i ; Y_1^i ; Z_1^i \end{bmatrix}=\frac{d^i}{f}\begin{bmatrix} x_1^i-c_x ; y_1^i-c_y ; f \end{bmatrix}$

$X_2=RX_1+T$

where $x,y,z$ refer to the pixel in the optical flow, and $X$ refers to initial picture.

Then we can minimize the reconstruction loss (as well as adversarial loss):

$L^{photo}=\frac{1}{wh}\sum_{x,y}||I_1(x,y)-I_2(x+U(x,y),y+V(x,y))||_1 $

according to the optical flow pixel value doesn’t change approximately between two consecutive frames.

Super resolution

Figure 6: Overview of AIGN for super resolution

Figure 7: Details of AIGN for super resolution

Note:For super resolution, we need to predict the high resolution photo. So generator needs to predict a high resolution photo, while the projection needs to reconstruct the low resolution one.

For this task, because it’s easy to get access to paired data, so the paper focuses on the bias in the memory, or we can consider the bias as a kind of condition. That is, if we give the memory some constraints, such as make the memory only includes some photos with big noses, then the generated image is likely to ‘possess’ this feature.

Inpainting

Figure 8: Overview of AIGN for inpainting

Figure 9: Details of AIGN for inpainting

Note:For image inpainting, we need to predict the complete photo. So generator needs to predict a complete photo, while the projection needs to reconstruct the masked one.

Like the former task, because it’s easy to get access to paired data, so the paper focuses on the bias in the memory, or we can consider the bias as a kind of condition. Also, if we give the memory some constraints, such as make the memory only includes some photos with big noses, and the mask corresponds with the location of nose, then the generated image is likely to ‘possess’ this feature.

Experiment

1. 3D Human Pose Estimation

Human3.6M dataset

Compare 3 supervised learning methods

Forward2Dto3D

3D interpreter

AIGN

Figure 10: 3D reconstruction error in H3.6M of different methods

Figure 11: Samples of predicted 3D human poses

2. Structure from Motion

Dataset: Virtual KITTI(VKITTI)

50 high-resolution monocular videos (21260 frames)

Figure 12: Samples of predicted depth and optical flow

Figure 13: Loss curves for different tasks with different methods

Figure 14: Comparison of different methods for camera motion estimation

3. Image-to-image translation

CeleA dataset

202599 face images, 10177 unique identities and 40 binary attributes

Figure 15: Biased adversarial super-resolution

Figure 16: Biased adversarial inpainting

Discussion

Adversarial loss (so that image can seems to be real)

Image-to-image translation (why not generate after blurring it)

Self-supervised learning on validation dataset will be more reasonable than on test dataset in the paper

About the decoder (non-param render is better than param one because the latter one can be hard to control its learning)

Figure 17: Comparison between non-param and param render

Reference

Adversarial Inverse Graphics Networks: Learning 2D-To-3D Lifting and Image-To-Image Translation From Unpaired Supervision

Adversarial Inverse Graphics Networks

AIGN

Outline

Background

Algorithm

Experiment

Discussion

Reference

CATALOG

FEATURED TAGS

FRIENDS

Outline

Background

Related Work

Algorithm

Experiment

Discussion

Reference

CATALOG

FEATURED TAGS

FRIENDS