VEIGAR: View-consistent Explicit Inpainting and Geometry Alignment for 3D object Removal

NeurIPS 2025

An illustration of the VEIGAR inpainting and reconstruction pipeline. VEIGAR inpaints a single anchor view and propagates edits to other views using deep stereo depth-based projection, followed by completion and reconstruction guided by a scale-invariant depth loss. Compared to prior methods like SPIn-NeRF, VEIGAR produces sharper and more view-consistent results.

Abstract

Recent advances in Novel View Synthesis (NVS) and 3D generation have significantly improved editing tasks, with a primary emphasis on maintaining cross-view consistency throughout the generative process. Contemporary methods typically address this challenge using a dual-strategy framework: performing consistent 2D inpainting across all views guided by embedded priors either explicitly in pixel space or implicitly in latent space; and conducting 3D reconstruction with additional consistency guidance. Previous strategies, in particular, often require an initial 3D reconstruction phase to establish geometric structure, introducing considerable computational overhead. Even with the added cost, the resulting reconstruction quality often remains suboptimal. In this paper, we present VEIGAR, a computationally efficient framework that outperforms existing methods without relying on an initial reconstruction phase. VEIGAR leverages a lightweight foundation model to reliably align priors explicitly in the pixel space. In addition, we introduce a novel supervision strategy based on scale-invariant depth loss, which removes the need for traditional scale-and-shift operations in monocular depth regularization. Through extensive experimentation, VEIGAR establishes a new state-of-the-art benchmark in reconstruction quality and cross-view consistency, while achieving a threefold reduction in training time compared to the fastest existing method, highlighting its superior balance of efficiency and effectiveness.

Overview


Interpolate start reference image.

Pipeline overview. The first stage (a) performs stereo depth completion on an anchor view and estimates implicit intrinsics to enable accurate projection of inpainted content to other views. The second stage (b) uses a pretrained inpainting network to complete the masked regions in each projected view. The resulting multi-view images are then used for 3D reconstruction via Gaussian Splatting, guided by scale-invariant and photometric losses to ensure high-fidelity and geometrically consistent outputs.

Main Results

Original In-N-Out GScream Our

Ablation Results

\( \text{Ours w/o } \mathcal{L}_{\text{SI}} \) \( \text{Ours}\)