DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by
Distilling Neural Fields and Foundation Model Features

NeurIPS 2024

Letian Wang^1,2, Seung Wook Kim¹, Jiawei Yang³, Cunjun Yu^1,4, Boris Ivanovic¹,
Steven Waslander², Yue Wang^1,3, Sanja Fidler^1,2, Marco Pavone^1,5, Peter Karkus¹,

¹NVIDIA Research, ²University of Toronto, ³University of Southern California,
⁴National University of Singapore, ⁵Stanford University

Paper Video Code

DistillNeRF is a generalizable model for 3D scene representation, self-supervised by natural sensor streams along with distillation from offline NeRFs and vision foundation models. It supports rendering RGB, depth, and foundation feature images, without test-time per-scene optimization, and enables downstream tasks such as zero-shot 3D semantic occupancy prediction and open-vocabulary text queries.

DistillNeRF model architecture. Left: single-view encoding with two-stage probabilistic depth prediction; Center: multi-view pooling into a sparse hierarchical voxel representation using sparse quantization and convolution; Right: volumetric rendering from sparse hierarchical voxels.

Capabilities

Given single-frame multi-view cameras as input and without test-time per-scene optimization, DistillNeRF can reconstruct RGB images (row 2), estimate depth (row 3), render foundation model features (rows 4, 5) which enables open-vocabulary text queries (rows 6, 7, 8), and predict binary and semantic occupancy in zero shot (rows 9, 10).

Novel-View Synthesis

Given single-frame multi-view cameras as input and without test-time per-scene optimization, we can synthesize novel views.

Generalizability

Trained on the nuScenes dataset, our model demonstrates strong zero-shot transfer performance on the unseen Waymo NOTR dataset, achieving decent reconstruction quality (row 2). This quality can be further enhanced by applying simple color alterations to account for camera-specific coloring discrepancies (row 3). After fine-tuning (row 4), our model surpasses the offline per-scene optimized EmerNeRF, achieving higher PSNR (29.84 vs. 28.87) and SSIM (0.911 vs. 0.814).

Comparisons

Our generalizable DistillNerf is on par with SOTA offline per-scene optimized NeRF method (EmerNerf), and significantly outperforms SOTA generalizable methods (UniPAD and SelfOcc).

Quantitatively, DistillNeRF significantly outperforms state-of-the-art generalizable NeRF methods, while additionally supporing foundation model feature lifting. Specifically, our model outperforms the best-performing prior method (SelfOcc) in PSNR by 45.6% and 14.0%, and in SSIM by 64.9% and 27.1%, for reconstruction and novel-view synthesis, respectively. Our model is also on par with the per-scene optimized method (EmerNeRF), while only requiring single image input and getting rid of per-scene optimization.

Comparing to SOTA offline NeRFs: While we distill offline per-scene optimized NeRF (e.g. EmerNeRF) into DistillNeRF, we observe qualitative phenomenon where the student (DistillNeRF) surpassing the teacher (EmerNeRF), as DistillNeRF is more robust to sensor calibration noises and able to generate smoother predictions.

Comparing to SOTA online NeRFs: UniPAD generate blurry RGB reconstruction and shows strong LiDAR-scanning patterns in the rendered depth. SelfOcc generates gray images and inconsistent depth prediction in the high region of the image.

Ablation Studies

When removing Distillation from offline NeRFs, we observe downgraded depth prediction accuracy especially in the high region of the image. When removing the parameterized space, the model is only able to predict depth in the limited range.

BibTeX

@misc{wang2024distillnerf, title={DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features}, author={Letian Wang and Seung Wook Kim and Jiawei Yang and Cunjun Yu and Boris Ivanovic and Steven L. Waslander and Yue Wang and Sanja Fidler and Marco Pavone and Peter Karkus}, year={2024}, eprint={2406.12095}, archivePrefix={arXiv}, primaryClass={id='cs.CV' full_name='Computer Vision and Pattern Recognition' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.'} }

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

We thank the authors of Nerfies that kindly open sourced the template of this website.