Towards Scalable and Consistent 3D Editing

TL;DR: We introduce 3DEditVerse, the largest paired 3D editing benchmark, and propose 3DEditFormer, a mask-free transformer enabling precise, consistent, and scalable 3D edits.

3D editing—the task of locally modifying the geometry or appearance of a 3D asset—has wide applications in immersive content creation, digital entertainment, and AR/VR. However, unlike 2D editing, it remains challenging due to the need for cross-view consistency, structural fidelity, and fine-grained controllability. Existing approaches are often slow, prone to geometric distortions, or dependent on manual and accurate 3D masks that are error-prone and impractical. To address these challenges, we advance both the data and model fronts. On the data side, we introduce 3DEditVerse, the largest paired 3D editing benchmark to date, comprising 116,309 high-quality training pairs and 1,500 curated test pairs. Built through complementary pipelines of pose-driven geometric edits, foundation model-guided appearance edits, and human validation, 3DEditVerse ensures edit locality, multi-view consistency, and semantic alignment. On the model side, we propose 3DEditFormer, a 3D-structure-preserving conditional transformer. By enhancing image-to-3D generation with dual-guidance attention and time-adaptive gating, 3DEditFormer disentangles editable regions from preserved structure, enabling precise and consistent edits without requiring auxiliary 3D masks. Extensive experiments demonstrate that our framework outperforms state-of-the-art baselines both quantitatively and qualitatively, establishing a new standard for practical and scalable 3D editing. Dataset and code will be released.

3DEditFormer | Real-World Scanned Assets Editing

* From left to right: (1) Source Voxel (Input), (2) Source 3D (Input), (3) Target Image (Input), (4) Edited Voxel (Output), (5) Edited 3D (Output).

Editing prompt: Add a thick, branched crown of florets extending from the top of the asparagus spear.

Editing prompt: Remove the two brown arm-like protrusions on the sides of the backpack.

Editing prompt: Add a large, rigid rectangular compartment extending from the front of the backpack, flush with its main body.

Editing prompt: Replace the two red circular eyes with elongated oval indentations oriented horizontally.

Editing prompt: Replace the rounded dome-shaped top with a straight, cylindrical neck extension.

Editing prompt: Add a pair of sturdy, rectangular side handles protruding horizontally from the middle of the long sides.

Editing prompt: Replace the flat rectangular brush head with a rounded, dome-shaped head of equal width.

Editing prompt: Replace the curved, continuous backrest with a straight, rectangular panel of the same height.

3DEditFormer | Different View Editing

* From left to right: (1) Source Image, (2) Source Voxel (Input), (3) Source 3D (Input), (4) Target Image (Input), (5) Edited Voxel (Output), (6) Edited 3D (Output).

3DEditFormer (Ours)

VoxHammer

Editing prompt: Add a small, flat bed behind the cab.

3DEditFormer (Ours)

VoxHammer

Editing prompt: Add a bell to the cow's neck.

Comparison between lit and unlit rendering modes

Darkened regions appear under physically-based rendering due to the Cycles path-tracing engine (first and third rows).

3DEditVerse Dataset

* Generative Data from Text-Guided Editing are shown in the video.

Our 3DEditVerse, the largest paired 3D editing benchmark to date, comprising 116,309 high-quality training pairs and 1,500 curated test pairs. Built through complementary pipelines of pose-driven geometric edits, foundation model-guided appearance edits, and human validation, 3DEditVerse ensures edit locality, multi-view consistency, and semantic alignment.

3DEditFormer | Comparison with SoTA VoxHammer

Click on the cards to view extracted GLB files.

Our 3DEditFormer, a 3D-structure-preserving conditional transformer. By enhancing image-to-3D generation with dual-guidance attention and time-adaptive gating, 3DEditFormer disentangles editable regions from preserved structure, enabling precise and consistent edits without requiring auxiliary 3D masks.

3DEditFormer | Non-Human Pose Editing

* From left to right: (1) Source Voxel (Input), (2) Source 3D (Input), (3) Target Image (Input), (4) Edited Voxel (Output), (5) Edited 3D (Output).

Video 1. Raise the deer's head.

Video 2. Make the eagle fly.

Video 3. Raise the monkey's left hand.

Although our 3DEditFormer is trained on human-pose editing data, it can generalize naturally to non-human pose editing.

3DEditFormer | Deformation-like Editing

* From left to right: (1) Source Voxel (Input), (2) Source 3D (Input), (3) Target Image (Input), (4) Edited Voxel (Output), (5) Edited 3D (Output).

Video 4. Make the dog's limbs shorter and its body plumper.

Video 5. Make the tree grows straight.

Our 3DEditFormer can produce coherent structural adaptations for deformation-like edits, indicating that the model is not limited to part manipulation but can also accommodate smooth geometric transformations.

3DEditFormer | Real-Object Editing

* From left to right: (1) Source Voxel (Input), (2) Source 3D (Input), (3) Target Image (Input), (4) Edited Voxel (Output), (5) Edited 3D (Output).

Video 6. Replace the bathtub with a sink.

Video 7. Replace the ice cream with a watermelon base.

Video 8. Replace the crab with an octopus.

Video 9. Replace the arm with human hand holding a gun.

Video 10. Replace the sci-fi turret with a dragon head.

Our 3DEditFormer can generalize well to real-object editing.

The website template is borrowed from TRELLIS.