Layer-structured 3D Scene Inference
via View Synthesis

Shubham Tulsiani1
Richard Tucker2
Noah Snavely2
UC Berkeley1, Google2
In ECCV, 2018

In this work we go beyond 2.5D shape representations and learn to predict layered scene representations from single images that capture more complete scenes, including hidden objects. On the right, we show our method's predicted 2-layer texture and shape for the highlighted area: a, b) show the predicted textures for the foreground and background layers respectively, and c,d) depict the corresponding predicted inverse depth. Note how both predict structures behind the tree, such as the continuation of the building.

We present an approach to infer a layer-structured 3D representation of a scene from a single input image. This allows us to infer not only the depth of the visible pixels, but also capture the texture and depth for content in the scene that is not directly visible. We overcome the challenge posed by the lack of direct supervision by instead leveraging a more naturally available multi-view supervisory signal. Our insight is to use view synthesis as a proxy task: we enforce that our representation (inferred via a single image), when rendered from a novel perspective, matches the true observation. We present a learning framework that operationalizes this insight using a new, differentiable novel view renderer. We provide qualitative and quantitative validation of our approach in two different settings, and demonstrate that we can learn to capture the hidden aspects of a scene.


Tulsiani, Tucker, Snavely.

Layer-structured 3D Scene Inference
via View Synthesis.

In ECCV, 2018.

[pdf]     [Bibtex]




We thank Tinghui Zhou and John Flynn for helpful discussions and comments. This work was done while ST was an intern at Google. This webpage template was borrowed from some colorful folks.

Related Concurrent Work: A recent paper by Dhamo et. al. also pursues a similar layered scene representation with some interesting differences in the setup.