# Shubham Tulsiani

 I am an Assistant Professor at Carnegie Mellon University in the Robotics Institute. I am interested in building perception systems that can infer the spatial and physical structure of the world they observe. Please see these recent talks for an overview. Prior to joining CMU, I was a Research Scientist at FAIR, Pittsburgh working with Abhinav Gupta. I previously graduated from UC, Berkeley where I was advised by Jitendra Malik, and also frequently collaborated with Alyosha Efros.  shubhtuls AT cmu.edu 

## Teaching

 16-889: Learning for 3D Vision. Spring 2022

## Publications (selected | all)

2022

[New] Pre-train, Self-train, Distill: A simple recipe for Supersizing 3D Reconstruction
Kalyan Vasudev Alwala, Abhinav Gupta, Shubham Tulsiani
CVPR, 2022
pdf   project page   abstract   bibtex   code

Our work learns a unified model for single-view 3D reconstruction of objects from hundreds of semantic categories. As a scalable alternative to direct 3D supervision, our work relies on segmented image collections for learning 3D of generic categories. Unlike prior works that use similar supervision but learn independent category-specific models from scratch, our approach of learning a unified model simplifies the training process while also allowing the model to benefit from the common structure across categories. Using image collections from standard recognition datasets, we show that our approach allows learning 3D inference for over 150 object categories. We evaluate using two datasets and qualitatively and quantitatively show that our unified reconstruction approach improves over prior category-specific reconstruction baselines. Our final 3D reconstruction model is also capable of zero-shot inference on images from unseen object categories and we empirically show that increasing the number of training categories improves the reconstruction quality.

@inProceedings{vasudev2022ss3d,
title={Pre-train, Self-train, Distill: A simple recipe for Supersizing 3D Reconstruction},
author={Vasudev, Kalyan Alwala and  Gupta, Abhinav and Tulsiani, Shubham},
year={2022},
booktitle={Computer Vision and Pattern Recognition (CVPR)}
}

[New] What's in your hands? 3D Reconstruction of Generic Objects in Hands
Yufei Ye, Abhinav Gupta, Shubham Tulsiani
CVPR, 2022
pdf   project page   abstract   bibtex   code

Our work aims to reconstruct hand-held objects given a single RGB image. In contrast to prior works that typically assume known 3D templates and reduce the problem to 3D pose estimation, our work reconstructs generic hand-held object without knowing their 3D templates. Our key insight is that hand articulation is highly predictive of the object shape, and we propose an approach that conditionally reconstructs the object based on the articulation and the visual input. Given an image depicting a hand-held object, we first use off-the-shelf systems to estimate the underlying hand pose and then infer the object shape in a normalized hand-centric coordinate frame. We parameterized the object by signed distance which are inferred by an implicit network which leverages the information from both visual feature and articulation-aware coordinates to process a query point. We perform experiments across three datasets and show that our method consistently outperforms baselines and is able to reconstruct a diverse set of objects. We analyze the benefits and robustness of explicit articulation conditioning and also show that this allows the hand pose estimation to further improve in test-time optimization.

@inProceedings{ye2022ihoi,
title={What's in your hands? 3D Reconstruction of Generic Objects in Hands},
author={Ye, Yufei and  Gupta, Abhinav and Tulsiani, Shubham},
year={2022},
booktitle={Computer Vision and Pattern Recognition (CVPR)}
}

[New] AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation
Paritosh Mittal*, Yen-Chi Cheng*, Maneesh Singh, Shubham Tulsiani
CVPR, 2022
pdf   project page   abstract   bibtex   code

Powerful priors allow us to perform inference with insufficient information. In this paper, we propose an autoregressive prior for 3D shapes to solve multimodal 3D tasks such as shape completion, reconstruction, and generation. We model the distribution over 3D shapes as a non-sequential autoregressive distribution over a discretized, low-dimensional, symbolic grid-like latent representation of 3D shapes. This enables us to represent distributions over 3D shapes conditioned on information from an arbitrary set of spatially anchored query locations and thus perform shape completion in such arbitrary settings (e.g., generating a complete chair given only a view of the back leg). We also show that the learned autoregressive prior can be leveraged for conditional tasks such as single-view reconstruction and language-based generation. This is achieved by learning task-specific naive conditionals which can be approximated by light-weight models trained on minimal paired data. We validate the effectiveness of the proposed method using both quantitative and qualitative evaluation and show that the proposed method outperforms the specialized state-of-the-art methods trained for individual tasks.

@inproceedings{autosdf2022,
title={{AutoSDF}: Shape Priors for 3D Completion, Reconstruction and Generation},
author={Mittal, Paritosh and Cheng, Yen-Chi and Singh, Maneesh and Tulsiani, Shubham},
booktitle={CVPR},
year={2022}
}

2021

[New] NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild
Jason Y. Zhang, Gengshan Yang, Shubham Tulsiani*, and Deva Ramanan*
NeurIPS, 2021
pdf   project page   abstract   bibtex   video   code

Recent history has seen a tremendous growth of work exploring implicit representations of geometry and radiance, popularized through Neural Radiance Fields (NeRF). Such works are fundamentally based on a (implicit) volumetric representation of occupancy, allowing them to model diverse scene structure including translucent objects and atmospheric obscurants. But because the vast majority of real-world scenes are composed of well-defined surfaces, we introduce a surface analog of such implicit models called Neural Reflectance Surfaces (NeRS). NeRS learns a neural shape representation of a closed surface that is diffeomorphic to a sphere, guaranteeing water-tight reconstructions. Even more importantly, surface parameterizations allow NeRS to learn (neural) bidirectional surface reflectance functions (BRDFs) that factorize view-dependent appearance into environmental illumination, diffuse color (albedo), and specular "shininess." Finally, rather than illustrating our results on synthetic scenes or controlled in-the-lab capture, we assemble a novel dataset of multiview images from online marketplaces for selling goods. Such "in-the-wild" multiview image sets pose a number of challenges, including a small number of views with unknown/rough camera estimates. We demonstrate that surface-based neural reconstructions enable learning from such data, outperforming volumetric neural rendering-based reconstructions. We hope that NeRS serves as a first step toward building scalable, high-quality libraries of real-world shape, materials, and illumination.

@inproceedings{zhang2021ners,
title={{NeRS}: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild},
author={Zhang, Jason Y. and Yang, Gengshan and Tulsiani, Shubham and Ramanan, Deva},
booktitle={Conference on Neural Information Processing Systems},
year={2021}
}

No RL, No Simulation: Learning to Navigate without Navigating
Meera Hahn, Devendra Chaplot, Shubham Tulsiani, Mustafa Mukadam, James M. Rehg, Abhinav Gupta
NeurIPS, 2021
pdf   project page   abstract   bibtex   code

Most prior methods for learning navigation policies require access to simulation environments, as they need online policy interaction and rely on ground-truth maps for rewards. However, building simulators is expensive (requires manual effort for each and every scene) and creates challenges in transferring learned policies to robotic platforms in the real-world, due to the sim-to-real domain gap. In this paper, we pose a simple question: Do we really need active interaction, ground-truth maps or even reinforcement-learning (RL) in order to solve the image-goal navigation task? We propose a self-supervised approach to learn to navigate from only passive videos of roaming. Our approach, No RL, No Simulator (NRNS), is simple and scalable, yet highly effective. NRNS outperforms RL-based formulations by a significant margin. We present NRNS as a strong baseline for any future image-based navigation tasks that use RL or Simulation.

@article{hahn2021nrns,
title={No RL, No Simulation: Learning to Navigate without Navigating},
author={Meera Hahn and Devendra Chaplot and Shubham Tulsiani and Mustafa Mukadam and James Rehg and Abhinav Gupta},
booktitle={Neural Information Processing Systems},
year={2021},
}

A Differentiable Recipe for Learning Visual Non-Prehensile Planar Manipulation
Bernardo Aceituno, Alberto Rodriguez, Shubham Tulsiani, Abhinav Gupta, Mustafa Mukadam
CoRL, 2021
pdf   abstract   bibtex

Specifying tasks with videos is a powerful technique towards acquiring novel and general robot skills. However, reasoning over mechanics and dexterous interactions can make it challenging to scale visual learning for contact-rich manipulation. In this work, we focus on the problem of visual dexterous planar manipulation: given a video of an object in planar motion, find contact-aware robot actions that reproduce the same object motion. We propose a novel learning architecture that combines video decoding neural models with priors from contact mechanics by leveraging differentiable optimization and differentiable simulation. Through extensive simulated experiments, we investigate the interplay between traditional model-based techniques and modern deep learning approaches. We find that our modular and fully differentiable architecture outperforms learning-only methods on unseen objects and motions.

@inProceedings{aceituno21planar,
title={A Differentiable Recipe for Learning Visual Non-Prehensile Planar Manipulation},
author={Aceituno, Bernardo and Rodriguez, Alberto and Tulsiani, Shubham and Gupta, Abhinav and Mukadam, Mustafa},
year={2021},
booktitle={Conference on Robot Learning (CoRL)}
}

Where2Act: From Pixels to Actions for Articulated 3D Objects
Kaichun Mo, Leonidas J. Guibas, Mustafa Mukadam, Abhinav Gupta, Shubham Tulsiani
ICCV, 2021
pdf   project page   abstract   bibtex   code

One of the fundamental goals of visual perception is to allow agents to meaningfully interact with their environment. In this paper, we take a step towards that long-term goal -- we extract highly localized actionable information related to elementary actions such as pushing or pulling for articulated objects with movable parts. For example, given a drawer, our network predicts that applying a pulling force on the handle opens the drawer. We propose, discuss, and evaluate novel network architectures that given image and depth data, predict the set of actions possible at each pixel, and the regions over articulated parts that are likely to move under the force. We propose a learning-from-interaction framework with an online data sampling strategy that allows us to train the network in simulation (SAPIEN) and generalizes across categories. But more importantly, our learned models even transfer to real-world data.

@inProceedings{mo2021where2act,
title={Where2Act: From Pixels to Actions for Articulated 3D Objects},
author={Mo, Kaichun and Guibas, Leonidas and Mukadam, Mustafa and Gupta, Abhinav and Tulsiani, Shubham},
year={2021},
booktitle={International Conference on Computer Vision (ICCV)}
}

PixelTransformer: Sample Conditioned Signal Generation
Shubham Tulsiani, Abhinav Gupta
ICML, 2021
pdf   project page   abstract   bibtex   code

We propose a generative model that can infer a distribution for the underlying spatial signal conditioned on sparse samples e.g. plausible images given a few observed pixels. In contrast to sequential autoregressive generative models, our model allows conditioning on arbitrary samples and can answer distributional queries for any location. We empirically validate our approach across three image datasets and show that we learn to generate diverse and meaningful samples, with the distribution variance reducing given more observed pixels. We also show that our approach is applicable beyond images and can allow generating other types of spatial outputs e.g. polynomials, 3D shapes, and videos.

@inProceedings{tulsiani2021pixel,
title={PixelTransformer: Sample Conditioned Signal Generation},
author={Tulsiani, Shubham and  Gupta, Abhinav},
year={2021},
booktitle={International Conference on Machine Learning (ICML)}
}

Shelf-Supervised Mesh Prediction in the Wild
Yufei Ye, Shubham Tulsiani, Abhinav Gupta
CVPR, 2021
pdf   project page   abstract   bibtex   code

We aim to infer 3D shape and pose of object from a single image and propose a learning-based approach that can train from unstructured image collections, supervised by only segmentation outputs from off-the-shelf recognition systems (i.e. 'shelf-supervised'). We first infer a volumetric representation in a canonical frame, along with the camera pose. We enforce the representation geometrically consistent with both appearance and masks, and also that the synthesized novel views are indistinguishable from image collections. The coarse volumetric prediction is then converted to a mesh-based representation, which is further refined in the predicted camera frame. These two steps allow both shape-pose factorization from image collections and per-instance reconstruction in finer details. We examine the method on both synthetic and real-world datasets and demonstrate its scalability on 50 categories in the wild, an order of magnitude more classes than existing works.

@inProceedings{ye2021shelf,
title={Shelf-Supervised Mesh Prediction in the Wild},
author={Ye, Yufei and Tulsiani, Shubham and  Gupta, Abhinav},
year={2021},
booktitle={Computer Vision and Pattern Recognition (CVPR)}
}

2020

See, Hear, Explore: Curiosity via Audio-Visual Association
Victoria Dean, Shubham Tulsiani, Abhinav Gupta
NeurIPS, 2020
pdf   project page   abstract   bibtex   video   code

Exploration is one of the core challenges in reinforcement learning. A common formulation of curiosity-driven exploration uses the difference between the real future and the future predicted by a learned model. However, predicting the future is an inherently difficult task which can be ill-posed in the face of stochasticity. In this paper, we introduce an alternative form of curiosity that rewards novel associations between different senses. Our approach exploits multiple modalities to provide a stronger signal for more efficient exploration. Our method is inspired by the fact that, for humans, both sight and sound play a critical role in exploration. We present results on several Atari environments and Habitat (a photorealistic navigation simulator), showing the benefits of using an audio-visual association model for intrinsically guiding learning agents in the absence of external rewards.

@article{dean2020see,
title={See, hear, explore: Curiosity via audio-visual association},
author={Dean, Victoria and Tulsiani, Shubham and Gupta, Abhinav},
journal={Advances in Neural Information Processing Systems},
volume={33},
year={2020}
}

Sarah Young, Dhiraj Gandhi, Shubham Tulsiani, Abhinav Gupta, Pieter Abbeel, Lerrel Pinto
CORL, 2020
pdf   project page   abstract   bibtex   video   code

Visual imitation learning provides a framework for learning complex manipulation behaviors by leveraging human demonstrations. However, current interfaces for imitation such as kinesthetic teaching or teleoperation prohibitively restrict our ability to efficiently collect large-scale data in the wild. Obtaining such diverse demonstration data is paramount for the generalization of learned skills to novel scenarios. In this work, we present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots. We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot’s end-effector. To extract action information from these visual demonstrations, we use off-the-shelf Structure from Motion (SfM) techniques in addition to training a finger detection network. We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task. For both tasks, we use standard behavior cloning to learn executable policies from the previously collected offline demonstrations. To improve learning performance, we employ a variety of data augmentations and provide an extensive analysis of its effects. Finally, we demonstrate the utility of our interface by evaluating on real robotic scenarios with previously unseen objects and achieve a 87% success rate on pushing and a 62% success rate on stacking.

@inProceedings{young2020vime,
author={Young, Sarah and Gandhi, Dhiraj and Tulsiani, Shubham and Gupta, Abhinav and Abbeel, Pieter and Pinto, Lerrel},
year={2020},
booktitle={Conference on Robot Learning (CORL)}
}

Articulation-aware Canonical Surface Mapping
Nilesh Kulkarni, Abhinav Gupta, David Fouhey, Shubham Tulsiani
CVPR, 2020
pdf   project page   abstract   bibtex   video   code

We tackle the tasks of: 1) predicting a Canonical Surface Mapping (CSM) that indicates the mapping from 2D pixels to corresponding points on a canonical template shape , and 2) inferring the articulation and pose of the template corresponding to the input image. While previous approaches rely on leveraging keypoint supervision for learning, we present an approach that can learn without such annotations. Our key insight is that these tasks are geometrically related, and we can obtain supervisory signal via enforcing consistency among the predictions. We present results across a diverse set of animate object categories, showing that our method can learn articulation and CSM prediction from image collections using only foreground mask labels for training. We empirically show that allowing articulation helps learn more accurate CSM prediction, and that enforcing the consistency with predicted CSM is similarly critical for learning meaningful articulation.

@inProceedings{kulkarni2020acsm,
title={Articulation-aware Canonical Surface Mapping},
author={Kulkarni, Nilesh and Gupta, Abhinav and Fouhey, David and Tulsiani, Shubham},
year={2020},
booktitle={Computer Vision and Pattern Recognition (CVPR)}
}

Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects
Kiana Ehsani, Shubham Tulsiani, Saurabh Gupta, Ali Farhadi, Abhinav Gupta
CVPR, 2020
pdf   project page   abstract   bibtex   code

When we humans look at a video of human-object interaction, we can not only infer what is happening but we can even extract actionable information and imitate those interactions. On the other hand, current recognition or geometric approaches lack the physicality of action representation. In this paper, we take a step towards more physical understanding of actions. We address the problem of inferring contact points and the physical forces from videos of humans interacting with objects. One of the main challenges in tackling this problem is obtaining ground-truth labels for forces. We sidestep this problem by instead using a physics simulator for supervision. Specifically, we use a simulator to predict effects, and enforce that estimated forces must lead to same effect as depicted in the video. Our quantitative and qualitative results show that (a) we can predict meaningful forces from videos whose effects lead to accurate imitation of the motions observed, (b) by jointly optimizing for contact point and force prediction, we can improve the performance on both tasks in comparison to independent training, and (c) we can learn a representation from this model that generalizes to novel objects using few shot examples.

@inProceedings{ehsani2020force,
title={Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects},
author={Ehsani, Kiana and  Tulsiani, Shubham and Gupta, Saurabh and Farhadi, Ali and Gupta, Abhinav},
year={2020},
booktitle={Computer Vision and Pattern Recognition (CVPR)}
}

Intrinsic Motivation for Encouraging Synergistic Behavior
Rohan Chitnis, Shubham Tulsiani, Saurabh Gupta, Abhinav Gupta
ICLR, 2020
pdf   project page   abstract   bibtex

We study the role of intrinsic motivation as an exploration bias for reinforcement learning in sparse-reward synergistic tasks, which are tasks where multiple agents must work together to achieve a goal they could not individually. Our key idea is that a good guiding principle for intrinsic motivation in synergistic tasks is to take actions which affect the world in ways that would not be achieved if the agents were acting on their own. Thus, we propose to incentivize agents to take (joint) actions whose effects cannot be predicted via a composition of the predicted effect for each individual agent. We study two instantiations of this idea, one based on the true states encountered, and another based on a dynamics model trained concurrently with the policy. While the former is simpler, the latter has the benefit of being analytically differentiable with respect to the action taken. We validate our approach in robotic bimanual manipulation tasks with sparse rewards; we find that our approach yields more efficient learning than both 1) training with only the sparse reward and 2) using the typical surprise-based formulation of intrinsic motivation, which does not bias toward synergistic behavior.

@inproceedings{chitnis20synergy,
title={Intrinsic Motivation for Encouraging Synergistic Behavior},
author={Chitnis, Rohan and Tulsiani, Shubham and Gupta, Saurabh and Gupta, Abhinav},
booktitle={ICLR},
year={2020}
}

Discovering Motor Programs by Recomposing Demonstrations
Tanmay Shankar, Shubham Tulsiani, Lerrel Pinto, Abhinav Gupta
ICLR, 2020
pdf   abstract   bibtex

We learn a space of motor primitives from unannotated robot demonstrations, and show these primitives are semantically meaningful and can be composed for new robot tasks. Abstract: In this paper, we present an approach to learn recomposable motor primitives across large-scale and diverse manipulation demonstrations. Current approaches to decomposing demonstrations into primitives often assume manually defined primitives and bypass the difficulty of discovering these primitives. On the other hand, approaches in primitive discovery put restrictive assumptions on the complexity of a primitive, which limit applicability to narrow tasks. Our approach attempts to circumvent these challenges by jointly learning both the underlying motor primitives and recomposing these primitives to form the original demonstration. Through constraints on both the parsimony of primitive decomposition and the simplicity of a given primitive, we are able to learn a diverse set of motor primitives, as well as a coherent latent representation for these primitives. We demonstrate, both qualitatively and quantitatively, that our learned primitives capture semantically meaningful aspects of a demonstration. This allows us to compose these primitives in a hierarchical reinforcement learning setup to efficiently solve robotic manipulation tasks like reaching and pushing.

@inproceedings{shankar20motor,
title={Discovering Motor Programs by Recomposing Demonstrations},
author={Shankar, Tanmay and Tulsiani, Shubham and Pinto, Lerrel and Gupta, Abhinav},
booktitle={ICLR},
year={2020}
}

Efficient Bimanual Manipulation using Learned Task Schemas
Rohan Chitnis, Shubham Tulsiani, Saurabh Gupta, Abhinav Gupta
ICRA, 2020
preprint   abstract   bibtex   video

We address the problem of effectively composing skills to solve sparse-reward tasks in the real world. Given a set of parameterized skills (such as exerting a force or doing a top grasp at a location), our goal is to learn policies that invoke these skills to efficiently solve such tasks. Our insight is that for many tasks, the learning process can be decomposed into learning a state-independent task schema (a sequence of skills to execute) and a policy to choose the parameterizations of the skills in a state-dependent manner. For such tasks, we show that explicitly modeling the schema's state-independence can yield significant improvements in sample efficiency for model-free reinforcement learning algorithms. Furthermore, these schemas can be transferred to solve related tasks, by simply re-learning the parameterizations with which the skills are invoked. We find that doing so enables learning to solve sparse-reward tasks on real-world robotic systems very efficiently. We validate our approach experimentally over a suite of robotic bimanual manipulation tasks, both in simulation and on real hardware.

@inproceedings{chitnis20schema,
title={Efficient Bimanual Manipulation Using Learned Task Schemas},
author={Chitnis, Rohan and Tulsiani, Shubham and Gupta, Saurabh and Gupta, Abhinav},
booktitle={ICRA},
year={2020}
}

2019

Object-centric Forward Modeling for Model Predictive Control
Yufei Ye, Dhiraj Gandhi, Abhinav Gupta, Shubham Tulsiani
CORL, 2019
pdf   project page   abstract   bibtex

We present an approach to learn an object-centric forward model, and show that this allows us to plan for sequences of actions to achieve distant desired goals. We propose to model a scene as a collection of objects, each with an explicit spatial location and implicit visual feature, and learn to model the effects of actions using random interaction data. Our model allows capturing the robot-object and object-object interactions, and leads to more sample-efficient and accurate predictions. We show that this learned model can be leveraged to search for action sequences that lead to desired goal configurations, and that in conjunction with a learned correction module, this allows for robust closed loop execution. We present experiments both in simulation and the real world, and show that our approach improves over alternate implicit or pixel-space forward models.

@inProceedings{ye2019ocm,
title={Object-centric Forward Modeling for Model Predictive Control},
author={Ye, Yufei and Gandhi, Dhiraj and Gupta, Abhinav and Tulsiani, Shubham},
year={2019},
booktitle={Conference on Robot Learning (CORL)}
}

Canonical Surface Mapping via Geometric Cycle Consistency
Nilesh Kulkarni, Abhinav Gupta*, Shubham Tulsiani*
ICCV, 2019
pdf   project page   abstract   bibtex   video   code

We explore the task of Canonical Surface Mapping (CSM). Specifically, given an image, we learn to map pixels on the object to their corresponding locations on an abstract 3D model of the category. But how do we learn such a mapping? A supervised approach would require extensive manual labeling which is not scalable beyond a few hand-picked categories. Our key insight is that the CSM task (pixel to 3D), when combined with 3D projection (3D to pixel), completes a cycle. Hence, we can exploit a geometric cycle consistency loss, thereby allowing us to forgo the dense manual supervision. Our approach allows us to train a CSM model for a diverse set of classes, without sparse or dense keypoint annotation, by leveraging only foreground mask labels for training. We show that our predictions also allow us to infer dense correspondence between two images, and compare the performance of our approach against several methods that predict correspondence by leveraging varying amount of supervision.

@inProceedings{kulkarni2019csm,
title={Canonical Surface Mapping via Geometric Cycle Consistency},
author={Kulkarni, Nilesh and Gupta, Abhinav and Tulsiani, Shubham},
year={2019},
booktitle={International Conference on Computer Vision (ICCV)}
}

Compositional Video Prediction
Yufei Ye, Maneesh Singh, Abhinav Gupta*, Shubham Tulsiani*
ICCV, 2019
pdf   project page   abstract   bibtex   code

We present an approach for pixel-level future prediction given an input image of a scene. We observe that a scene is comprised of distinct entities that undergo motion and present an approach that operationalizes this insight. We implicitly predict future states of independent entities while reasoning about their interactions, and compose future video frames using these predicted states. We overcome the inherent multi-modality of the task using a global trajectory-level latent random variable, and show that this allows us to sample diverse and plausible futures. We empirically validate our approach against alternate representations and ways of incorporating multi-modality. We examine two datasets, one comprising of stacked objects that may fall, and the other containing videos of humans performing activities in a gym, and show that our approach allows realistic stochastic video prediction across these diverse settings.

@inProceedings{ye2019cvp,
title={Compositional Video Prediction},
author={Ye, Yufei and Singh, Maneesh and Gupta, Abhinav and Tulsiani, Shubham},
year={2019},
booktitle={International Conference on Computer Vision (ICCV)}
}

3D-RelNet: Joint Object and Relational Network for 3D Prediction
Nilesh Kulkarni, Ishan Misra, Shubham Tulsiani, Abhinav Gupta
ICCV, 2019
pdf   project page   abstract   bibtex   code

We propose an approach to predict the 3D shape and pose for the objects present in a scene. Existing learning based methods that pursue this goal make independent predictions per object, and do not leverage the relationships amongst them. We argue that reasoning about these relationships is crucial, and present an approach to incorporate these in a 3D prediction framework. In addition to independent per-object predictions, we predict pairwise relations in the form of relative 3D pose, and demonstrate that these can be easily incorporated to improve object level estimates. We report performance across different datasets (SUNCG, NYUv2), and show that our approach significantly improves over independent prediction approaches while also outperforming alternate implicit reasoning methods.

@inProceedings{kulkarni2019relnet,
title={3D-RelNet: Joint Object and Relational Network for 3D Prediction},
author={Nilesh Kulkarni, Ishan Misra, Shubham Tulsiani, Abhinav Gupta},
booktitle={ICCV},
year={2019}
}

Order-Aware Generative Modeling Using the 3D-Craft Dataset
Zhuoyuan Chen*, Demi Guo*, Tong Xiao*, et. al.
ICCV, 2019
pdf   abstract   bibtex

In this paper, we study the problem of sequentially building houses in the game of Minecraft, and demonstrate that learning the ordering can make for more effective autoregressive models. Given a partially built house made by a human player, our system tries to place additional blocks in a human-like manner to complete the house. We introduce a new dataset, HouseCraft, for this new task. HouseCraft contains the sequential order in which 2,500 Minecraft houses were built from scratch by humans. The human action sequences enable us to learn an order-aware generative model called Voxel-CNN. In contrast to many generative models where the sequential generation ordering either does not matter (e.g. holistic generation with GANs), or is manually/arbitrarily set by simple rules (e.g. raster-scan order), our focus is on an ordered generation that imitates humans. To evaluate if a generative model can accurately predict human-like actions, we propose several novel quantitative metrics. We demonstrate that our Voxel-CNN model is simple and effective at this creative task, and can serve as a strong baseline for future research in this direction.

@inProceedings{chen2019craft,
title={Order-Aware Generative Modeling Using the 3D-Craft Dataset},
author={Zhuoyuan Chen, Demi Guo, Tong Xiao, Saining Xie, Xinlei Chen, Haonan Yu, Jonathan Gray, Kavya Srinet, Haoqi Fan, Jerry Ma, Charles R. Qi, Shubham Tulsiani, Arthur Szlam, and C. Lawrence Zitnick},
booktitle={ICCV},
year={2019}
}

Learning Unsupervised Multi-View Stereopsis via Robust Photometric Consistency
Tejas Khot*, Shubham Agrawal*, Shubham Tulsiani, Christoph Mertz, Simon Lucey, Martial Hebert
arXiv preprint, 2019
pdf   project page   abstract   bibtex   code

We present a learning based approach for multi-view stereopsis (MVS). While current deep MVS methods achieve impressive results, they crucially rely on ground-truth 3D training data, and acquisition of such precise 3D geometry for supervision is a major hurdle. Our framework instead leverages photometric consistency between multiple views as supervisory signal for learning depth prediction in a wide baseline MVS setup. However, naively applying photo consistency constraints is undesirable due to occlusion and lighting changes across views. To overcome this, we propose a robust loss formulation that: a) enforces first order consistency and b) for each point, selectively enforces consistency with some views, thus implicitly handling occlusions. We demonstrate our ability to learn MVS without 3D supervision using a real dataset, and show that each component of our proposed robust loss results in a significant improvement. We qualitatively observe that our reconstructions are often more complete than the acquired ground truth, further showing the merits of this approach. Lastly, our learned model generalizes to novel settings, and our approach allows adaptation of existing CNNs to datasets without ground-truth 3D by unsupervised finetuning.

@article{khot2019learning,
title={Learning Unsupervised Multi-View Stereopsis via Robust Photometric Consistency},
author={Khot, Tejas and Agrawal, Shubham and Tulsiani, Shubham and Mertz, Christoph and Lucey, Simon and Hebert, Martial},
journal={arXiv preprint arXiv:1905.02706},
year={2019}
}

2018

Layer-structured 3D Scene Inference via View Synthesis
Shubham Tulsiani, Richard Tucker, Noah Snavely
ECCV, 2018
pdf   project page   abstract   bibtex   code

We present an approach to infer a layer-structured 3D representation of a scene from a single input image. This allows us to infer not only the depth of the visible pixels, but also capture the texture and depth for content in the scene that is not directly visible. We overcome the challenge posed by the lack of direct supervision by instead leveraging a more naturally available multi-view supervisory signal. Our insight is to use view synthesis as a proxy task: we enforce that our representation (inferred via a single image), when rendered from a novel perspective, matches the true observation. We present a learning framework that operationalizes this insight using a new, differentiable novel view renderer. We provide qualitative and quantitative validation of our approach in two different settings, and demonstrate that we can learn to capture the hidden aspects of a scene.

@inProceedings{lsiTulsiani18,
title={Layer-structured 3D Scene Inference via View Synthesis},
author = {Shubham Tulsiani
and Richard Tucker
and Noah Snavely},
booktitle={ECCV},
year={2018}
}

Learning Category-Specific Mesh Reconstruction from Image Collections
Angjoo Kanazawa*, Shubham Tulsiani*, Alexei A. Efros, Jitendra Malik
ECCV, 2018
pdf   project page   abstract   bibtex   video   code

We present a learning framework for recovering the 3D shape, camera, and texture of an object from a single image. The shape is represented as a deformable 3D mesh model of an object category where a shape is parameterized by a learned mean shape and per-instance predicted deformation. Our approach allows leveraging an annotated image collection for training, where the deformable model and the 3D prediction mechanism are learned without relying on ground-truth 3D or multi-view supervision. Our representation enables us to go beyond existing 3D prediction approaches by incorporating texture inference as prediction of an image in a canonical appearance space. Additionally, we show that semantic keypoints can be easily associated with the predicted shapes. We present qualitative and quantitative results of our approach on the CUB dataset, and show that we can learn to predict the diverse shapes and textures across birds using only an annotated image collection. We also demonstrate the the applicability of our method for learning the 3D structure of other generic categories.

@inProceedings{cmrKanazawa18,
title={Learning Category-Specific Mesh Reconstruction
from Image Collections},
author = {Angjoo Kanazawa and
Shubham Tulsiani
and Alexei A. Efros
and Jitendra Malik},
booktitle={ECCV},
year={2018}
}

Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction
Shubham Tulsiani, Alexei A. Efros, Jitendra Malik
CVPR, 2018
pdf   project page   abstract   bibtex   code

We present a framework for learning single-view shape and pose prediction without using direct supervision for either. Our approach allows leveraging multi-view observations from unknown poses as supervisory signal during training. Our proposed training setup enforces geometric consistency between the independently predicted shape and pose from two views of the same instance. We consequently learn to predict shape in an emergent canonical (view-agnostic) frame along with a corresponding pose predictor. We show empirical and qualitative results using the ShapeNet dataset and observe encouragingly competitive performance to previous techniques which rely on stronger forms of supervision. We also demonstrate the applicability of our framework in a realistic setting which is beyond the scope of existing techniques: using a training dataset comprised of online product images where the underlying shape and pose are unknown.

@inProceedings{mvcTulsiani18,
title={Multi-view Consistency as Supervisory Signal
for Learning Shape and Pose Prediction},
author = {Shubham Tulsiani
and Alexei A. Efros
and Jitendra Malik},
booktitle={Computer Vision and Pattern Recognition (CVPR)},
year={2018}
}

Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene
Shubham Tulsiani, Saurabh Gupta, David Fouhey, Alexei A. Efros, Jitendra Malik
CVPR, 2018
pdf   project page   abstract   bibtex   code

The goal of this work is to take a single 2D image of a scene and recover the 3D structure in terms of a small set of factors: a layout representing the enclosing surfaces as well as a set of objects represented in terms of shape and pose. We propose a convolutional neural network-based approach to predict this representation and benchmark it on a large dataset of indoor scenes. Our experiments evaluate a number of practical design questions, demonstrate that we can infer this representation, and quantitatively and qualitatively demonstrate its merits compared to alternate representations.

@inProceedings{factored3dTulsiani17,
title={Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene},
author = {Shubham Tulsiani
and Saurabh Gupta
and David Fouhey
and Alexei A. Efros
and Jitendra Malik},
booktitle={Computer Vision and Pattern Recognition (CVPR)},
year={2018}
}

2017

Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency
Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, Jitendra Malik
CVPR, 2017
pdf   project page   abstract   bibtex   slides   talk   code   blog post

We study the notion of consistency between a 3D shape and a 2D observation and propose a differentiable formulation which allows computing gradients of the 3D shape given an observation from an arbitrary view. We do so by reformulating view consistency using a differentiable ray consistency (DRC) term. We show that this formulation can be incorporated in a learning framework to leverage different types of multi-view observations e.g. foreground masks, depth, color images, semantics etc. as supervision for learning single-view 3D prediction. We present empirical analysis of our technique in a controlled setting. We also show that this approach allows us to improve over existing techniques for single-view reconstruction of objects from the PASCAL VOC dataset.

@inProceedings{drcTulsiani17,
title={Multi-view Supervision for Single-view Reconstruction
via Differentiable Ray Consistency},
author = {Shubham Tulsiani
and Tinghui Zhou
and Alexei A. Efros
and Jitendra Malik},
booktitle={Computer Vision and Pattern Recognition (CVPR)},
year={2017}
}

Learning Shape Abstractions by Assembling Volumetric Primitives
Shubham Tulsiani, Hao Su, Leonidas J. Guibas, Alexei A. Efros, Jitendra Malik
CVPR, 2017
pdf   project page   abstract   bibtex   code (torch)   code (pytorch - unofficial)

We present a learning framework for abstracting complex shapes by learning to assemble objects using 3D volumetric primitives. In addition to generating simple and geometrically interpretable explanations of 3D objects, our framework also allows us to automatically discover and exploit consistent structure in the data. We demonstrate that using our method allows predicting shape representations which can be leveraged for obtaining a consistent parsing across the instances of a shape collection and constructing an interpretable shape similarity measure. We also examine applications for image-based prediction as well as shape manipulation.

@inProceedings{abstractionTulsiani17,
title={Learning Shape Abstractions by Assembling Volumetric Primitives},
author = {Shubham Tulsiani
and Hao Su
and Leonidas J. Guibas
and Alexei A. Efros
and Jitendra Malik},
booktitle={Computer Vision and Pattern Recognition (CVPR)},
year={2017}
}

Hierarchical Surface Prediction for 3D Object Reconstruction
Christian Häne, Shubham Tulsiani, Jitendra Malik
3DV, 2017
pdf   abstract   bibtex   slides   code

Recently, Convolutional Neural Networks have shown promising results for 3D geometry prediction. They can make predictions from very little input data such as for example a single color image, depth map or a partial 3D volume. A major limitation of such approaches is that they only predict a coarse resolution voxel grid, which does not capture the surface of the objects well. We propose a general framework, called hierarchical surface prediction (HSP), which facilitates prediction of high resolution voxel grids. The main insight is that it is sufficient to predict high resolution voxels around the predicted surfaces. The exterior and interior of the objects can be represented with coarse resolution voxels. This allows us to predict significantly higher resolution voxel grids around the surface, from which triangle meshes can be extracted. Our approach is general and not dependent on a specific input type. In our experiments we show results for geometry prediction from color images, depth images and shape completion from partial voxel grids. Our analysis shows that the network is able to predict the surface more accurately than a low resolution prediction.

@incollection{hspHane17,
author = {Christian H{\"a}ne and
Shubham Tulsiani and
Jitendra Malik},
title = {Hierarchical Surface Prediction for 3D Object Reconstruction},
booktitle = {arXiv preprint arXiv:1704.00710},
year = {2017}
}

2016

Learning Category-Specific Deformable 3D Models for Object Reconstruction
Shubham Tulsiani*, Abhishek Kar*, João Carreira, Jitendra Malik
TPAMI, 2016
pdf   abstract   bibtex   code

We address the problem of fully automatic object localization and reconstruction from a single image. This is both a very challenging and very important problem which has, until recently, received limited attention due to difficulties in segmenting objects and predicting their poses. Here we leverage recent advances in learning convolutional networks for object detection and segmentation and introduce a complementary network for the task of camera viewpoint prediction. These predictors are very powerful, but still not perfect given the stringent requirements of shape reconstruction. Our main contribution is a new class of deformable 3D models that can be robustly fitted to images based on noisy pose and silhouette estimates computed upstream and that can be learned directly from 2D annotations available in object detection datasets. Our models capture top-down information about the main global modes of shape variation within a class providing a low-frequency'' shape. In order to capture fine instance-specific shape details, we fuse it with a high-frequency component recovered from shading cues. A comprehensive quantitative analysis and ablation study on the PASCAL 3D+ dataset validates the approach as we show fully automatic reconstructions on PASCAL VOC as well as large improvements on the task of viewpoint prediction.

@article{pamishapeTulsianiKCM15,
author = {Shubham Tulsiani and
Abhishek Kar and
Jo{\~{a}}o Carreira and
Jitendra Malik},
title = {Learning Category-Specific Deformable 3D
Models for Object Reconstruction},
journal = {TPAMI},
year = {2016},
}

View Synthesis by Appearance Flow
Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, Alexei A. Efros
ECCV, 2016
pdf   abstract   bibtex   code

Given one or more images of an object (or a scene), is it possible to synthesize a new image of the same instance observed from an arbitrary viewpoint? In this paper, we attempt to tackle this problem, known as novel view synthesis, by re-formulating it as a pixel copying task that avoids the notorious difficulties of generating pixels from scratch. Our approach is built on the observation that the visual appearance of different views of the same instance is highly correlated. Such correlation could be explicitly learned by training a convolutional neural network (CNN) to predict \emph{appearance flows} -- 2-D coordinate vectors specifying which pixels in the input view could be used to reconstruct the target view. We show that for both objects and scenes, our approach is able to generate higher-quality synthesized views with crisp texture and boundaries than previous CNN-based techniques.

@incollection{appFlowZhou16,
author = {Tinghui Zhou and
Shubham Tulsiani and
Weilun Sun and
Jitendra Malik and
Alexei A. Efros},
title = {View Synthesis by Appearance Flow},
booktitle = {ECCV},
year = {2016}
}

2015

Pose Induction for Novel Object Categories
Shubham Tulsiani, João Carreira, Jitendra Malik
ICCV, 2015
pdf   abstract   bibtex   code

We address the task of predicting pose for objects of unannotated object categories from a small seed set of annotated object classes. We present a generalized classifier that can reliably induce pose given a single instance of a novel category. In case of availability of a large collection of novel instances, our approach then jointly reasons over all instances to improve the initial estimates. We empirically validate the various components of our algorithm and quantitatively show that our method produces reliable pose estimates. We also show qualitative results on a diverse set of classes and further demonstrate the applicability of our system for learning shape models of novel object classes.

@inProceedings{poseInductionTCM15,
author    = {Shubham Tulsiani and
Jo{\~{a}}o Carreira and
Jitendra Malik},
title     = {Pose Induction for Novel Object Categories},
year={2015},
booktitle={International Conference on Computer Vision (ICCV)}
}

Amodal Completion and Size Constancy in Natural Scenes
Abhishek Kar, Shubham Tulsiani, João Carreira, Jitendra Malik
ICCV, 2015
pdf   abstract   bibtex

We consider the problem of enriching current object detection systems with veridical object sizes and relative depth estimates from a single image. There are several technical challenges to this, such as occlusions, lack of calibration data and the scale ambiguity between object size and distance. These have not been addressed in full generality in previous work. Here we propose to tackle these issues by building upon advances in object recognition and using recently created large-scale datasets. We first introduce the task of amodal bounding box completion, which aims to infer the the full extent of the object instances in the image. We then propose a probabilistic framework for learning category-specific object size distributions from available annotations and leverage these in conjunction with amodal completions to infer veridical sizes of objects in novel images. Finally, we introduce a focal length prediction approach that exploits scene recognition to overcome inherent scale ambiguities and demonstrate qualitative results on challenging real-world scenes.

@inProceedings{amodalKTCM15,
author    = {Abhishek Kar and
Shubham Tulsiani and
Jo{\~{a}}o Carreira and
Jitendra Malik},
title     = {Amodal Completion and Size Constancy in Natural Scenes},
year={2015},
booktitle={International Conference on Computer Vision (ICCV)}
}

Viewpoints and Keypoints
Shubham Tulsiani, Jitendra Malik
CVPR, 2015
pdf   abstract   bibtex   code

We characterize the problem of pose estimation for rigid objects in terms of determining viewpoint to explain coarse pose and keypoint prediction to capture the finer details. We address both these tasks in two different settings - the constrained setting with known bounding boxes and the more challenging detection setting where the aim is to simultaneously detect and correctly estimate pose of objects. We present Convolutional Neural Network based architectures for these and demonstrate that leveraging viewpoint estimates can substantially improve local appearance based keypoint predictions. In addition to achieving significant improvements over state-of-the-art in the above tasks, we analyze the error modes and effect of object characteristics on performance to guide future efforts towards this goal.

@inProceedings{vpsKpsTulsianiM15,
author    = {Shubham Tulsiani and Jitendra Malik},
title     = {Viewpoints and Keypoints},
year={2015},
booktitle={Computer Vision and Pattern Recognition (CVPR)}
}

Category-Specific Object Reconstruction from a Single Image
Abhishek Kar*, Shubham Tulsiani*, João Carreira, Jitendra Malik
CVPR, 2015 (Best Student Paper Award)
pdf   project page   abstract   bibtex   code   note

Object reconstruction from a single image - in the wild - is a problem where we can make progress and get meaningful results today. This is the main message of this paper, which introduces an automated pipeline with pixels as inputs and 3D surfaces of various rigid categories as outputs in images of realistic scenes. At the core of our approach are deformable 3D models that can be learned from 2D annotations available in existing object detection datasets, that can be driven by noisy automatic object segmentations and which we complement with a bottom-up module for recovering high-frequency shape details. We perform a comprehensive quantitative analysis and ablation study of our approach using the recently introduced PASCAL 3D+ dataset and show very encouraging automatic reconstructions on PASCAL VOC.

@inProceedings{shapesKarTCM15,
author    = {Abhishek Kar and
Shubham Tulsiani and
Jo{\~{a}}o Carreira and
Jitendra Malik},
title     = {Category-Specific Object Reconstruction from a Single Image},
year={2015},
booktitle={Computer Vision and Pattern Recognition (CVPR)}
}

Virtual View Networks for Object Reconstruction
João Carreira, Abhishek Kar, Shubham Tulsiani, Jitendra Malik
CVPR, 2015
pdf   abstract   bibtex   video   code

All that structure from motion algorithms "see" are sets of 2D points. We show that these impoverished views of the world can be faked for the purpose of reconstructing objects in challenging settings, such as from a single image, or from a few ones far apart, by recognizing the object and getting help from a collection of images of other objects from the same class. We synthesize virtual views by computing geodesics on novel networks connecting objects with similar viewpoints, and introduce techniques to increase the specificity and robustness of factorization-based object reconstruction in this setting. We report accurate object shape reconstruction from a single image on challenging PASCAL VOC data, which suggests that the current domain of applications of rigid structure-from-motion techniques may be significantly extended.

@inProceedings{vvnCarreiraKTM15,
author    = {Jo{\~{a}}o Carreira and
Abhishek Kar and
Shubham Tulsiani and
Jitendra Malik},
title     = {Virtual View Networks for Object Reconstruction},
year={2015},
booktitle={Computer Vision and Pattern Recognition (CVPR)}
}

2013

A colorful approach to text processing by example
Kuat Yessenov, Shubham Tulsiani, Aditya Menon, Robert C Miller, Sumit Gulwani, Butler Lampson, Adam Kalai
UIST, 2013
pdf   abstract   bibtex

Text processing, tedious and error-prone even for programmers, remains one of the most alluring targets of Programming by Example. An examination of real-world text processing tasks found on help forums reveals that many such tasks, beyond simple string manipulation, involve latent hierarchical structures. We present STEPS, a programming system for processing structured and semi-structured text by example. STEPS users create and manipulate hierarchical structure by example. In a between-subject user study on fourteen computer scientists, STEPS compares favorably to traditional programming.

@inproceedings{yessenov2013colorful,
title={A colorful approach to text processing by example},
author={Yessenov, Kuat and
Tulsiani, Shubham and
}