Multi-view Supervision for Single-view Reconstruction
via Differentiable Ray Consistency

University of California, Berkeley

In CVPR, 2017

Left: We measure the consistency between a (predicted) 3D shape and an observation image using a differentiable ray consistency formulation. Right: Sample results obtained by applying our framework on ShapeNet dataset using multiple color images as supervision for training. We show the input RGB image and the visualize the 3D shape predicted using our multi-view supervised CNN from two novel views.

We study the notion of consistency between a 3D shape and a 2D observation and propose a differentiable formulation which allows computing gradients of the 3D shape given an observation from an arbitrary view. We do so by reformulating view consistency using a differentiable ray consistency (DRC) term. We show that this formulation can be incorporated in a learning framework to leverage different types of multi-view observations e.g. foreground masks, depth, color images, semantics etc. as supervision for learning single-view 3D prediction. We present empirical analysis of our technique in a controlled setting. We also show that this approach allows us to improve over existing techniques for single-view reconstruction of objects from the PASCAL VOC dataset.

Paper

Tulsiani, Zhou, Efros, Malik.

Multi-view Supervision for Single-view Reconstruction
via Differentiable Ray Consistency.

In CVPR, 2017. (Oral)

[pdf]

[Bibtex]

Code

[GitHub]

Applications

Multi-view Mask/Depth Supervision (ShapeNet)		Single-view Mask Supervision (PASCAL VOC)
Sample results on ShapeNet dataset using multiple depth images as supervision for training. a) Input image. b,c) Predicted 3D shape.		Sample results on PASCAL VOC dataset using pose and foreground masks as supervision for training. a) Input image. b,c) Predicted 3D shape.
Driving Sequences as Multi-view Supervision (Cityscapes)		Multi-view Color Images as Supervision (ShapeNet)
Sample results on Cityscapes dataset using multiple depth and semantic segmentation images seen in driving sequences as supervision for training. a) Input image. b,c) Predicted 3D shape visualized by rendering inferred depth and semantics under simulated forward motion.		Sample results on ShapeNet dataset using multiple RGB images as supervision for training. a) Input image. b,c) Predicted 3D shape.

Acknowledgements

We thank the anonymous reviewers for helpful comments. This work was supported in part by Intel/NSF VEC award IIS-1539099, NSF Award IIS-1212798, and the Berkeley Fellowship to ST. We gratefully acknowledge NVIDIA corporation for the donation of Tesla GPUs used for this research. This webpage template was borrowed from some colorful folks.