LargeSpatialModel: Real-time Unposed Images to Semantic 3D

Zhiwen Fan1,2* Jian Zhang3* Wenyan Cong1 Peihao Wang1 Renjie Li4 Kairun Wen3 Shijie Zhou5 Achuta Kadambi5 Zhangyang Wang1 Danfei Xu2,6 Boris Ivanovic2 Marco Pavone2,7 Yue Wang2,8
1UT Austin2Nvidia Research3Xiamen University4TAMU
5UCLA6GaTech7Stanford University8USC
NeurIPS 2024
comparison on ACID dataset
TL;DR: LSM utilizes two unposed and uncalibrated images as input, and reconstructs the explicit radiance field, encompassing geometry, appearance, and semantics in real-time.

Demo



Abstract

A classical problem in computer vision is to reconstruct and understand the 3D structure from a limited number of images to accurately interpret and export geometry, appearance, and semantics. Traditional approaches typically decompose this objective into multiple subtasks, involving several stages of complicated mapping among different data representations. For instance, dense reconstruction through Structure-from-Motion (SfM) requires transforming a set of multi-view images into key points and camera parameters before estimating structures. 3D understanding relies on a lengthy reconstruction pipeline before inputting into data- and task-specific neural networks. This paradigm can result in extensive processing times and substantial engineering efforts for processing each scene.


In this work, we introduce the Large Spatial Model (LSM), a point-based representation that directly processes unposed RGB images into semantic 3D. This new model simultaneously infers geometry, appearance, and semantics within a scene, and synthesizes versatile label maps at novel views, all in a single feed-forward pass. To represent the scene, we employ a generic Transformer-based framework that integrates global geometry by pixel-aligned point maps. To facilitate scene attributes regression, we adopt local context aggregation with multi-scale fusion tailored for enhanced prediction accuracy. Addressing the scarcity of labeled 3D semantic data and enhancing scene manipulation capabilities via natural language, we incorporate a well-trained 2D model into a 3D consistent semantic feature field. An efficient decoder parameterizes a set of anisotropic Gaussians, enabling the rendering of scene attributes at novel views. Supervised end-to-end learning and comprehensive experiments on various tasks demonstrate that LSM can unify multiple 3D vision tasks. It achieves both real-time reconstruction and rendering speeds and outperforms state-of-the-art baselines.


Method Overview

comparison on ACID dataset

Our method utilizes input images from which pixel-aligned point maps are regressed using a generic Transformer. Point-based scene parameters are then predicted employing another Transformer that facilitates local context aggregation and hierarchical fusion. The model elevate 2D pre-trained feature to facilate consistent 3D feature field. It is supervised end-to-end, minimizing the loss function through comparisons against ground truth and rasterized feature maps on new views. During the inference stage, our approach is capable of predicting the scene representation without requiring camera parameters, enabling real-time semantic 3D reconstruction.

Results