Outline
The paper proposes the Adversarial Inverse Graphics networks (AIGNs) for weakly supervised learning:
- Combining feedback from rendering predictions
- With distribution matching between their predictions and a collection of ground-truth factors
- And apply them to some 3D tasks and facial image transformation
Background
- Feed-forward models
- Unpaired supervision
- About tasks included in this paper
- 3D human pose estimation
- 3D structure and egomotion estimation
- Super resolution
- Inpainting
Related Work
- Inverse-graphics networks
- 3D interpreter networks
- 3D interpreter networks
- Reconstruction loss without adversarial loss
- GAN
- Adversarial loss and L2 loss without reconstruction loss
- Annotated pairs supervision
Algorithm
Overview
First, the loss function of our network consists of two parts (reconstruction loss and adversarial loss):
3D Human Pose Estimation
Note:For 3D Human Pose Estimation, we need to predict 3D Pose according to a given 2D picture. So generator needs to predict 3D from 2D (Human keypoints heatmaps), while the projection needs to reconstruct 2D heatmaps from predicted 3D results.
For different tasks, there exist different methods of projection. For this task, we use some graphics theory.
The predicted 3D results can be decomposed into some components:
$x_{3D}=R\sum_{j=1}^{|B|}\alpha_jB_j$
where, $B_j$ is one of the bases of human 3D structure and the paper chose 60 from 96 bases with PCA.
Then the projection can be figured out by:
$x_{2D}^{proj}=Px_{3D}+\begin{bmatrix} c_{x}; c_{y}; 0 \end{bmatrix}$ where $P=\begin{bmatrix} f & 0 & 0 & 0 ; 0 & f & 0 & 0; 0 & 0 & 1 & 0 \end{bmatrix}$
Structure from Motion
Note:For Structure from Motion, we need to predict camera motion and depth according to given consecutive two frames. So generator needs to predict R, t and depth, while the projection needs to reconstruct t+1 frame picture from predicted R, t and depth.
For this task, we use some other graphics theory.
The projection is based on the optical flow theory:
$X_1^i=\begin{bmatrix} X_1^i ; Y_1^i ; Z_1^i \end{bmatrix}=\frac{d^i}{f}\begin{bmatrix} x_1^i-c_x ; y_1^i-c_y ; f \end{bmatrix}$
$X_2=RX_1+T$
where $x,y,z$ refer to the pixel in the optical flow, and $X$ refers to initial picture.
Then we can minimize the reconstruction loss (as well as adversarial loss):
$L^{photo}=\frac{1}{wh}\sum_{x,y}||I_1(x,y)-I_2(x+U(x,y),y+V(x,y))||_1 $
according to the optical flow pixel value doesn’t change approximately between two consecutive frames.
Super resolution
Note:For super resolution, we need to predict the high resolution photo. So generator needs to predict a high resolution photo, while the projection needs to reconstruct the low resolution one.
For this task, because it’s easy to get access to paired data, so the paper focuses on the bias in the memory, or we can consider the bias as a kind of condition. That is, if we give the memory some constraints, such as make the memory only includes some photos with big noses, then the generated image is likely to ‘possess’ this feature.
Inpainting
Note:For image inpainting, we need to predict the complete photo. So generator needs to predict a complete photo, while the projection needs to reconstruct the masked one.
Like the former task, because it’s easy to get access to paired data, so the paper focuses on the bias in the memory, or we can consider the bias as a kind of condition. Also, if we give the memory some constraints, such as make the memory only includes some photos with big noses, and the mask corresponds with the location of nose, then the generated image is likely to ‘possess’ this feature.
Experiment
1. 3D Human Pose Estimation
- Human3.6M dataset
- Compare 3 supervised learning methods
- Forward2Dto3D
- 3D interpreter
- AIGN
2. Structure from Motion
- Dataset: Virtual KITTI(VKITTI)
- 50 high-resolution monocular videos (21260 frames)
3. Image-to-image translation
- CeleA dataset
- 202599 face images, 10177 unique identities and 40 binary attributes
Discussion
- Adversarial loss (so that image can seems to be real)
- Image-to-image translation (why not generate after blurring it)
- Self-supervised learning on validation dataset will be more reasonable than on test dataset in the paper
- About the decoder (non-param render is better than param one because the latter one can be hard to control its learning)
Reference
-
Previous
Pixel Recurrent Neural Networks -
Next
Adversarial Examples Detection in Deep Networks with Convolutional Filter Statistics