StereoMFC

Incorporating dense depth into neural 3D representations for view synthesis and relighting

Arkadeep Narayan Chaudhury , Igor Vasiljevic, Sergey Zakharov , Vitor Guizilini , Rares Ambrus , Srinivasa Narasimhan , Christopher Atkeson

Content best viewed using Chrome, Safari or Edge.. More updates coming soon.

StereoMFC (stereo multi-flash camera) prototype

For our prototype, we use a pair of machine vision cameras ( FLIR Grashopper ) with a 1'', 4MP CMOS imaging sensor of resolution of 2048 \(\times\) 2048 pixels. As we focus mainly on small scenes, we use two sets of lenses that yield a narrow field of view -- 12mm and 16mm fixed focal length lens (Edmund Optics). We use 80W 5600K white LEDs (CREE LED) as flashes driven by a DC power supply and switched though MOSFETs controlled with a Arduino over USB. At each pose of our rig, we captured 12 images with each of the flash lights on (one light at a time) and one HDR image per camera. The cameras are configured to return a 12 bit Bayer image which is then de-Bayered to yield a 16 bit RGB image. We set the left and right cameras to be triggered simultaneously by an external synchronization signal. We configured the camera frame acquiring and the flash triggering programs to run on the same thread and synchronized the frame acquisition with the flashes through blocking function calls.

AltText

Demonstration of our pipeline

RGB and depth (Left)	RGB and depth (Right)	Normals and edges (Left)	Normals and edges (Right)

Mesh with diffuse texture	Left and right flash images	Mesh with truncated BRDF texture

Capturing and relighting a human face

Capture sequence (1 of 2)	Geometry	Surface Relighting	Volumetric Relighting

Our system captures the scene using a stereoMFC and yields color images (left and right channels), aligned depth maps (left and right), normals and depth edges. We use this to model the scene's radiance and geometry. After the scene's diffuse color has been captured, we use the multi-flash images and a truncated set of the Disney BRDF parameters to model the illumination dependent aspects of the scene.

Similarities with AdaptiveShells [Wang et al. SIGGRAPH Asia 2023]

GT(left) vs Reconstruction(right)	Crops	Shells (sampling volumes)

Processing geometry with thin details

RGB & depth (Left, geometry 1)	Normals & edges (Left, geometry 1)	RGB & depth (Right, geometry 2)	Normals & edges (Right, geometry 2)

Although a voumetric representation is more photorealistic for capturing thin geometries, we present a mesh based representation and the shells extracted by our method. The examples below have been trained with 4 stereo pairs: 7 images to train, 1 to test, total train time ~55 mins, 35K gradient steps, test PSNR: 29+. However, as we bake the appearance as diffuse texture, our psnr drops to about 25 across both cases.

Geometry 1 : Textured mesh	Geometry 1 : Shell	Geometry 2 : Textured mesh	Geometry 2 : Shell

Quality of reconstruction with a volumetric renderer

The views are rendered at 256x256 pixels, for a higher resolution image, please check out our paper draft! The training was stopped at a test PSNR of 27.5+ or 100K gradient steps (whichever was earlier).

VolSDF\(^{++}\)

NeUS\(^{++}\)

AdaShell\(^{++}\)

UniSurf\(^{++}\)

Reconstructing a very shiny object

The reconstruction is created with 7 RGBD images from 4 stereo pairs (one RGBD reserved for validation). Time taken to refine geometry and base color: ~40 mins (100K grad steps). Time to recover partial BRDF model (~4 hrs). We recovered specularity, roughness and \(r_0\), the Fresnel reflection component. The web based display tool used here ignores \(r_0\).

Captured RGB and Depth	Multi-flash (12 x 4 x 2) (Left & Right)	Mesh with truncated BRDF texture

Relighting faces

The reconstruction is created with 2 stereo pairs -- 3 RGBD images to train, 1 to test. Time taken to refine geometry and base color: ~20 mins (100K grad steps). Time to recover partial BRDF model (~4 hrs). We recovered specularity, roughness and \(r_0\), the Fresnel reflection component. The movie is rendered approximately from the view of a camera placed between the stereo cameras. We use Blenders keyframing tool to manually choose 20 locations of the light 10 on the plane of the camera and 10 around the object and render 200 relit images using Blender. Blender does not support our exact reflectace model so like before, we drop \(r_0\), the Fresnel reflection component. We use Blender's Cycles renderer with an SPP of 2048 @ 1536x1536 pixels and rescale it to 370 x 370 pixels below.

Multi-flash pair (12 x 2) (L & R)	Normals and edges (or specularities) (L & R)	200 new light positions @ 1FPS

Collecting data with a manipulator versus a turn-table

At its core, our method integrates a set of posed RGBD images to train a neural 3D model. As implemented, we assume the scene to be static and the rig moving. However, our pipeline can accommodate a turn-table. Note that commercial turn-table scanners integrate the data using a combination of the encoders on the turn-table and point cloud registration. The sytem's camera-projector pair is tuned to yield dense scene points at close distances. Our stereo backend has a limit on the disparity range and will break at those resolutions. The lowest camera-object distance we can accommodate is 400 mm with a 16 mm lens (see "Reconstructing a very shiny object" demo above), where we struggle due to the learnt stereo over-smoothing the disparities. To incorporate a turn-table to our pipeline, we remove visual features from the background and force our hybrid RGBD pose estimation system (sec 4.1) to focus on the foreground using depth based masking. The depth masks are not perfect, as a result, the pose estimates for data collected with a turn-table are worse than the case when the robot is moving. In the example below we take 6 RGBD images -- 3 stereo pairs in case for the moving camera, 6 RGBD captures (left channel of stereo) with the rotating turn-table. Train on 5 and test on 1 image, we use AdaShell\(^++\) for this demo. The quality of reconstruction suffers a bit for the case with turn-table, but more importantly, the current pose estimation pipeline breaks before the turn-table can rotate the object significantly. At the moment, this is an engineering problem which we believe is solvable by designing our own robot mounted camera-projector system, which is outside the scope of this work.

Camera on robot (3 stereo pairs)	Ground Truth vs reconstruction	Object on turntable (6 poses)	Ground Truth vs reconstruction

Turntable sequence	Multi-flash images	Specular Tint	Relighting

View-point diversity or depth is required to train representations.

In absence of depth, radiance based methods need viewpoint diversity to disambiguate between appearance and geometry. Additionally, monocular supervision is very dependent on the scene-network combination. We incorporate metric depth from stereo to capture appearance and geometry. Below we show our result on training on the ReNE reflective data sequence. ReLight My NeRF [Toschi et al. '23'] has a low view point diversity and does not supply metric depth. Our experiment below tries to train a radiance model (architechture v5 from Toschi et al.). With the given description and hyperparameters, our re-implementation of architechture v5 was unable to achieve the reported reconstruction quality. Additionally, monocular methods (as recommended by MonoSDF [Yu et al '22]) did not generate meaningful supervisions on this sequence. As a result, although we made some progress in view interpolation, the geometry recoverd in the process is meaningless. The reconstructions of our scenes also do not work without depth supervision.

ReLightMyNeRF training images	Reconstructed radiance (PSNR 23.5)	Reconstructed shape

VolSDF vs VolSDF\(^{++}\)

VolSDF vs VolSDF\(^{++}\)

VolSDF vs VolSDF\(^{++}\)

MultiFlash images to approximate appearance components

HDR stereo pairs

MultiFlash Images

Appearance

'base-color' (Disney BRDF)

HDR stereo pairs

MultiFlash Images

Appearance

'specular_tint' (Disney BRDF)

Part of the source code and design of this webpage is adapted from Ref-NeRF project page. We would like to thank the authors for the inspiration.