DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by
Distilling Neural Fields and Foundation Model Features

NeurIPS 2024
1NVIDIA Research, 2University of Toronto, 3University of Southern California,
4National University of Singapore, 5Stanford University

DistillNeRF is a generalizable model for 3D scene representation, self-supervised by natural sensor streams along with distillation from offline NeRFs and vision foundation models. It supports rendering RGB, depth, and foundation feature images, without test-time per-scene optimization, and enables downstream tasks such as zero-shot 3D semantic occupancy prediction and open-vocabulary text queries.

DistillNeRF model architecture. Left: single-view encoding with two-stage probabilistic depth prediction; Center: multi-view pooling into a sparse hierarchical voxel representation using sparse quantization and convolution; Right: volumetric rendering from sparse hierarchical voxels.

Capabilities

Given single-frame multi-view cameras as input and without test-time per-scene optimization, DistillNeRF can reconstruct RGB images (row 2), estimate depth (row 3), render foundation model features (rows 4, 5) which enables open-vocabulary text queries (rows 6, 7, 8), and predict binary and semantic occupancy in zero shot (rows 9, 10).

Novel-View Synthesis

Given single-frame multi-view cameras as input and without test-time per-scene optimization, we can synthesize novel views.

Generalizability

Trained on the nuScenes dataset, our model demonstrates strong zero-shot transfer performance on the unseen Waymo NOTR dataset, achieving decent reconstruction quality (row 2). This quality can be further enhanced by applying simple color alterations to account for camera-specific coloring discrepancies (row 3). After fine-tuning (row 4), our model surpasses the offline per-scene optimized EmerNeRF, achieving higher PSNR (29.84 vs. 28.87) and SSIM (0.911 vs. 0.814).

Comparisons

Our generalizable DistillNerf is on par with SOTA offline per-scene optimized NeRF method (EmerNerf), and significantly outperforms SOTA generalizable methods (UniPAD and SelfOcc).

Comparing to SOTA offline NeRFs: While we distill offline per-scene optimized NeRF (e.g. EmerNeRF) into DistillNeRF, we observe qualitative phenomenon where the student (DistillNeRF) surpassing the teacher (EmerNeRF), as DistillNeRF is more robust to sensor calibration noises and able to generate smoother predictions.

Comparing to SOTA online NeRFs: UniPAD generate blurry RGB reconstruction and shows strong LiDAR-scanning patterns in the rendered depth. SelfOcc generates gray images and inconsistent depth prediction in the high region of the image.

Ablation Studies

When removing Distillation from offline NeRFs, we observe downgraded depth prediction accuracy especially in the high region of the image. When removing the parameterized space, the model is only able to predict depth in the limited range.

   
     

BibTeX

     
@misc{wang2024distillnerf,
      title={DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features}, 
      author={Letian Wang and Seung Wook Kim and Jiawei Yang and Cunjun Yu and Boris Ivanovic and Steven L. Waslander and Yue Wang and Sanja Fidler and Marco Pavone and Peter Karkus},
      year={2024},
      eprint={2406.12095},
      archivePrefix={arXiv},
      primaryClass={id='cs.CV' full_name='Computer Vision and Pattern Recognition' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.'}
    }