Learning to predict Scene Level
Implicit 3D from Posed RGBD data


Nilesh Kulkarni
Linyi Jin
Justin Johnson
David F. Fouhey

University of Michigan, Ann Arbor

CVPR 2023

[pdf]
[code]
[video]

TL;DR We use a D2-DRDF model to predict a 3D implicit function from a single input image. Unlike other methods, D2-DRDF does not depend on mesh supervision during training and can directly operate with raw RGB-D data obtained from scene captures.

Sample results obtained from previously unseen images sourced from the Matterport3D (top) and OmniData (bottom) datasets. D2-DRDF exhibits the capability to reconstruct hidden sections of the floor and the rear portion of the couch.


Abstract

We introduce a method that can learn to predict scene-level implicit functions for 3D reconstruction from posed RGBD data. At test time, our system maps a previously unseen RGB image to a 3D reconstruction of a scene via implicit functions. While implicit functions for 3D reconstruction have often been tied to meshes, we show that we can train one using only a set of posed RGBD images. This setting may help 3D reconstruction unlock the sea of accelerometer+RGBD data that is coming with new phones. Our system, D2-DRDF, can match and sometimes outperform current methods that use mesh supervision and shows better robustness to sparse data.

Overview

In this paper, we propose a method for reconstructing 3D scenes, including occluded regions, based on RGB images that have not been previously seen. To train our approach, we utilize posed RGB and depth data. Our model represents the 3D scene using an the Directed Ray Distance Function (DRDF). In this work, we demonstrate how to geometrically supervise this DRDF function by leveraging partial observations obtained from auxiliary views. We summarize our key insights as follows:
  • Auxiliary views observe parts of the ray that are occluded in the reference view. These free space observations along the ray provide independent information for different segments of the ray.
  • Combining segment level information acros auxiliary views for a ray aids in supervising the distinct segments along the ray. This is accomplished through the application of equality and inequality constraints on the predicted values of DRDF.



Approach

We present an approach to train a model to predict 3D from single images. Our model is supervised with Posed RGBD data, below we highlight the key ideas.
Learning from Auxiliary Views. For each red ray originating from the reference camera (R), we extract depth information from an auxiliary image view for points along the ray. Views (a), (b), and (c) on the right capture distinct occluded segments along the ray, providing valuable free-space information. This information enables the creation of penalty functions to train the DRDF function.
Segment Types. When the ray from the reference camera is seen by an auxiliary views, there are segments of freespace. Depending on how these segments start and end, they place different constraints on the DRDF. Here we show a segment that starts with a disocclusion and ends with an intersection. The space between the s and e events is unoccupied and we convert this information to a penalty function that is used to train the model.

We show an interactive demo of this penalty plot in the next section for different segment types
.
Method Overview. During the training process, or when considering a specific ray from the reference view, we utilize auxiliary views to determine the free-space segments along the ray. Subsequently, for each 3D point on this ray, we employ our network to predict the DRDF value and calculate the associated penalty. During inference, our network is tasked with predicting the DRDF function for points within the image frustum of a single image. For more information on the DRDF function, you can visit this link.


Interactive Demo for Segment Penalty Plots

We demonstrate the influence of different segments along the ray on the imposition of distinct penalty functions for predicted DRDF valuesn show below. On the X-axis, we interactively select the positions of Intersection (I) and Occlusion (O) events along the ray. The placement of these events determines which DRDF functions are penalized as inconsistent with the observed intersection or occlusion. We generate a heat map to represent the penalty associated with the DRDF value (Y-axis) for a specific point along the ray (X-axis). Regions depicted in dark red indicate high penalty magnitude, while light grey regions indicate low penalty.

The key property of the DRDF function is for any point along the ray, \(z \in [0, Z]\), if \(DRDF(z)= d\) then there is an intersection at point \( z + d \) on the ray. All the penalty segments below are the implicition of this equation.
(a)
You can observe that this plot represents the penalty function for a segment between two intersection events (II). It is clear that the only DRDF values that meet the criteria align slopped at -1, this specific line depicted as brighter regions in the plot with zero penalty. Shifting the slider horizontally causes this line to move in parallel along the X-axis. This particular segment, denoted as II, imposes an equality constraint.
(b)
You can observe that this plot represents ray segment between two occlusion events (OO). The plot illustrates that DRDF values cannot fall within the red strip, establishing an inequality constraint that permits DRDF to take values outside of this region. By adjusting the positions of the O events, the width of the red strip can be increased or decreased.

(c)
This scenario encompasses events from both (a) and (b). Points located closer to the I event are limited to a single DRDF value. However, points closer to the O event must adhere to DRDF values that prevent any additional intersections within the IO segment. This segment introduces both an equality and an inequality constraint.
(d)
This scenario combines events from both (a) and (b). Points along the ray closer to the I event are constrained to a single DRDF value. On the other hand, points closer to the O event must assume DRDF values that prevent any additional intersections within the OI segment. Consequently, this segment introduces both an equality and an inequality constraint.

Overview Video



Paper

Kulkarni, Jin, Johnson, Fouhey

Learning to Predict Scene Level Implicit 3D from Posed RGBD data

CVPR, 2023

[pdf]
[bibtex]


Code

Coming soon...

 [GitHub]


Acknowledgements

We thank our colleagues for the wonderful discussions on the project (in alphabetical order). Richard Higgins, Sarah Jabour, Dandan Shan, Karan Desai, Mohamed El Banani, and Chris Rockwell for feedback. We acknowledge Shengyi Qian's help with ViewSeg code used to implement the PixelNeRF baseline. Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any Toyota entity. NK was supported by TRI. Toyota Research Institute (''TRI'') provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. This base version of the template is borrowed from colorful folks.