Introduction

Room Layout Estimation from a monocular image, which aims to delineate a 2D boxy representation of an indoor scene, is an essential step for a wide variety of computer vision tasks.
Prior works break the problem into two sub-tasks: semantic segmentation of floor, walls, ceiling to produce layout hypotheses, followed by an iterative optimization step to rank these hypotheses.

(S. Dasgupta, K. Fang, K. Chen, and S. Savarese. Delay: Robust spatial layout estimation for cluttered indoor scenes. In CVPR, 2016.)

Recently, methods based on CNN have improved results, but these methods use CNNs to generate a new set of “low-level” features and fall short of exploiting the end-to-end learning ability of CNNs. In other words, the raw CNN predictions need to be post-processed by an expensive hypotheses testing stage to produce the final layout.

(S. Dasgupta, K. Fang, K. Chen, and S. Savarese. Delay: Robust spatial layout estimation for cluttered indoor scenes. In CVPR, 2016.)

This paper adopts a more direct formulation of this problem as one of estimating an ordered set of room layout keypoints. The work predicts the locations of the room layout keypoints using RoomNet, an end-to-end trainable encoder-decoder network.

Keypoint-based room layout representation

Figure shows a list of room types with their respective keypoint definition. These 11 room layouts cover most of the possible situations under typical camera poses and common cuboid representations under “Manhattan world assumption”. Once the trained model predicts correct keypoint locations with an associated room type, we can then simply connect these points in a specific order to produce boxy room layout representation.

RoomNet

Architecture of RoomNet

RoomNet is based on SegNet, a framework initially designed for segmentation.

(V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561, 2015.)

The base architecture of RoomNet adopts essentially the same convolutional encoder-decoder network as in SegNet. It takes an image of an indoor scene and directly outputs a set of 2D room layout keypoints to recover the room layout structure.

Each keypoint ground truth is represented by a 2D Gaussian heatmap centered at the true keypoint location as one of the channels in the output layer.

Extending to multiple room types

To extend to multiple room types, the number of channels are increased in the output layer to match the total number of keypoints for all 11 room types (total 48 keypoints for 11 room types) and the loss function is

The first term in the loss function compares the predicted heatmaps to ground-truth heatmaps synthesized for each keypoint separately. The ground truth for each keypoint heatmap is a 2D Gaussian centered on the true keypoint location with standard deviation of 5 pixels.

The second term in the loss function encourages the side head fully-connected layers to produce a high confidence value with respect to the correct room type class label.

(I have not fully understood this part yet.)

A memory augmented recurrent encoder-decoder (MRED) structure is introduced whose goal is to mimic the behavior of a typical recurrent neural network in order to refine the predicted keypoint heatmaps over “time” – the artificial time steps created by the recurrent structure.

Deep supervision through time

Supervision is injected at the end of each time step to make the models easier to train. The same loss function L is applied to all the time steps as demonstrated in Figure

Experiments

Accuracy

pixel error: pixel-wise error between the predicted surface labels and ground truth labels
keypoint error: average Euclidean distance between the predicted keypoint and annotated keypoint locations, normalized by the image diagonal length.

Qualitative results

The framework is also robust to key- point occlusion by objects (ex: tables, chairs, beds), demonstrated in Figure (b)(c)(d)(f). The major failure cases are when room layout boundaries are barely visible (Figure (a)(c)) or when there are more than one plausible room layout explanations for a given image of a scene (Figure (b)(d)).

Analyzing RoomNet

Recurrent vs direct prediction

Importance of deep supervision through time

Reference

Lee C Y, Badrinarayanan V, Malisiewicz T, et al. RoomNet: End-to-End Room Layout Estimation[J]. 2017.
S. Dasgupta, K. Fang, K. Chen, and S. Savarese. Delay: Robust spatial layout estimation for cluttered indoor scenes. In CVPR, 2016.

RoomNet End-to-End Room Layout Estimation

Room Layout Estimation